diff --git a/umn/source/_static/images/en-us_image_0000001072559365.png b/umn/source/_static/images/en-us_image_0000001072559365.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001072559365.png differ diff --git a/umn/source/_static/images/en-us_image_0000001080201158.png b/umn/source/_static/images/en-us_image_0000001080201158.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001080201158.png differ diff --git a/umn/source/_static/images/en-us_image_0000001085773316.png b/umn/source/_static/images/en-us_image_0000001085773316.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001085773316.png differ diff --git a/umn/source/_static/images/en-us_image_0000001086795516.png b/umn/source/_static/images/en-us_image_0000001086795516.png new file mode 100644 index 0000000..e19fa78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001086795516.png differ diff --git a/umn/source/_static/images/en-us_image_0000001087171010.png b/umn/source/_static/images/en-us_image_0000001087171010.png new file mode 100644 index 0000000..23d0548 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001087171010.png differ diff --git a/umn/source/_static/images/en-us_image_0000001087459671.png b/umn/source/_static/images/en-us_image_0000001087459671.png new file mode 100644 index 0000000..cc85af0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001087459671.png differ diff --git a/umn/source/_static/images/en-us_image_0000001088608580.png b/umn/source/_static/images/en-us_image_0000001088608580.png new file mode 100644 index 0000000..b4ecd27 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001088608580.png differ diff --git a/umn/source/_static/images/en-us_image_0000001125163135.png b/umn/source/_static/images/en-us_image_0000001125163135.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001125163135.png differ diff --git a/umn/source/_static/images/en-us_image_0000001127057881.png b/umn/source/_static/images/en-us_image_0000001127057881.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001127057881.png differ diff --git a/umn/source/_static/images/en-us_image_0000001133372255.png b/umn/source/_static/images/en-us_image_0000001133372255.png new file mode 100644 index 0000000..283d006 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001133372255.png differ diff --git a/umn/source/_static/images/en-us_image_0000001135452179.png b/umn/source/_static/images/en-us_image_0000001135452179.png new file mode 100644 index 0000000..f3efd3c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001135452179.png differ diff --git a/umn/source/_static/images/en-us_image_0000001135745627.png b/umn/source/_static/images/en-us_image_0000001135745627.png new file mode 100644 index 0000000..69e289e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001135745627.png differ diff --git a/umn/source/_static/images/en-us_image_0000001151336594.png b/umn/source/_static/images/en-us_image_0000001151336594.png new file mode 100644 index 0000000..90cb2a2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001151336594.png differ diff --git a/umn/source/_static/images/en-us_image_0000001151963015.png b/umn/source/_static/images/en-us_image_0000001151963015.png new file mode 100644 index 0000000..c5cb07b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001151963015.png differ diff --git a/umn/source/_static/images/en-us_image_0000001159690571.png b/umn/source/_static/images/en-us_image_0000001159690571.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001159690571.png differ diff --git a/umn/source/_static/images/en-us_image_0000001159847251.png b/umn/source/_static/images/en-us_image_0000001159847251.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001159847251.png differ diff --git a/umn/source/_static/images/en-us_image_0000001169371695.png b/umn/source/_static/images/en-us_image_0000001169371695.png new file mode 100644 index 0000000..9a286a1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001169371695.png differ diff --git a/umn/source/_static/images/en-us_image_0000001184290228.png b/umn/source/_static/images/en-us_image_0000001184290228.png new file mode 100644 index 0000000..f0513ac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001184290228.png differ diff --git a/umn/source/_static/images/en-us_image_0000001194201259.png b/umn/source/_static/images/en-us_image_0000001194201259.png new file mode 100644 index 0000000..9e79934 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001194201259.png differ diff --git a/umn/source/_static/images/en-us_image_0000001194201487.png b/umn/source/_static/images/en-us_image_0000001194201487.png new file mode 100644 index 0000000..883da30 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001194201487.png differ diff --git a/umn/source/_static/images/en-us_image_0000001194317737.png b/umn/source/_static/images/en-us_image_0000001194317737.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001194317737.png differ diff --git a/umn/source/_static/images/en-us_image_0000001195220440.png b/umn/source/_static/images/en-us_image_0000001195220440.png new file mode 100644 index 0000000..e91f8b6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001195220440.png differ diff --git a/umn/source/_static/images/en-us_image_0000001197336255.png b/umn/source/_static/images/en-us_image_0000001197336255.png new file mode 100644 index 0000000..df8beea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001197336255.png differ diff --git a/umn/source/_static/images/en-us_image_0000001205479339.png b/umn/source/_static/images/en-us_image_0000001205479339.png new file mode 100644 index 0000000..479bfc7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001205479339.png differ diff --git a/umn/source/_static/images/en-us_image_0000001214152530.png b/umn/source/_static/images/en-us_image_0000001214152530.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001214152530.png differ diff --git a/umn/source/_static/images/en-us_image_0000001214312492.png b/umn/source/_static/images/en-us_image_0000001214312492.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001214312492.png differ diff --git a/umn/source/_static/images/en-us_image_0000001214315364.png b/umn/source/_static/images/en-us_image_0000001214315364.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001214315364.png differ diff --git a/umn/source/_static/images/en-us_image_0000001216164294.png b/umn/source/_static/images/en-us_image_0000001216164294.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001216164294.png differ diff --git a/umn/source/_static/images/en-us_image_0000001223688037.png b/umn/source/_static/images/en-us_image_0000001223688037.png new file mode 100644 index 0000000..5c77c55 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001223688037.png differ diff --git a/umn/source/_static/images/en-us_image_0000001226576418.png b/umn/source/_static/images/en-us_image_0000001226576418.png new file mode 100644 index 0000000..d9d93ff Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001226576418.png differ diff --git a/umn/source/_static/images/en-us_image_0000001227056330.png b/umn/source/_static/images/en-us_image_0000001227056330.png new file mode 100644 index 0000000..75e2640 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001227056330.png differ diff --git a/umn/source/_static/images/en-us_image_0000001229609903.png b/umn/source/_static/images/en-us_image_0000001229609903.png new file mode 100644 index 0000000..dd86754 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001229609903.png differ diff --git a/umn/source/_static/images/en-us_image_0000001229690017.png b/umn/source/_static/images/en-us_image_0000001229690017.png new file mode 100644 index 0000000..33440a4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001229690017.png differ diff --git a/umn/source/_static/images/en-us_image_0000001239292351.gif b/umn/source/_static/images/en-us_image_0000001239292351.gif new file mode 100644 index 0000000..4cab727 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001239292351.gif differ diff --git a/umn/source/_static/images/en-us_image_0000001239732331.gif b/umn/source/_static/images/en-us_image_0000001239732331.gif new file mode 100644 index 0000000..4cab727 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001239732331.gif differ diff --git a/umn/source/_static/images/en-us_image_0000001258875319.png b/umn/source/_static/images/en-us_image_0000001258875319.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001258875319.png differ diff --git a/umn/source/_static/images/en-us_image_0000001259115323.png b/umn/source/_static/images/en-us_image_0000001259115323.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001259115323.png differ diff --git a/umn/source/_static/images/en-us_image_0000001259272397.png b/umn/source/_static/images/en-us_image_0000001259272397.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001259272397.png differ diff --git a/umn/source/_static/images/en-us_image_0000001261300062.png b/umn/source/_static/images/en-us_image_0000001261300062.png new file mode 100644 index 0000000..855c353 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001261300062.png differ diff --git a/umn/source/_static/images/en-us_image_0000001271157721.png b/umn/source/_static/images/en-us_image_0000001271157721.png new file mode 100644 index 0000000..bcde0d1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001271157721.png differ diff --git a/umn/source/_static/images/en-us_image_0000001271536445.png b/umn/source/_static/images/en-us_image_0000001271536445.png new file mode 100644 index 0000000..16e351e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001271536445.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738100.png b/umn/source/_static/images/en-us_image_0000001295738100.png new file mode 100644 index 0000000..74c23b8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738100.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738104.png b/umn/source/_static/images/en-us_image_0000001295738104.png new file mode 100644 index 0000000..8c65f65 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738104.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738112.png b/umn/source/_static/images/en-us_image_0000001295738112.png new file mode 100644 index 0000000..e39d384 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738112.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738144.jpg b/umn/source/_static/images/en-us_image_0000001295738144.jpg new file mode 100644 index 0000000..d6cd61e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738144.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295738168.png b/umn/source/_static/images/en-us_image_0000001295738168.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738168.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738236.png b/umn/source/_static/images/en-us_image_0000001295738236.png new file mode 100644 index 0000000..2fab271 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738236.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738244.png b/umn/source/_static/images/en-us_image_0000001295738244.png new file mode 100644 index 0000000..1689fa5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738244.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738308.png b/umn/source/_static/images/en-us_image_0000001295738308.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738308.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738316.png b/umn/source/_static/images/en-us_image_0000001295738316.png new file mode 100644 index 0000000..8b844c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738316.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738324.jpg b/umn/source/_static/images/en-us_image_0000001295738324.jpg new file mode 100644 index 0000000..9b75f6f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738324.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295738372.jpg b/umn/source/_static/images/en-us_image_0000001295738372.jpg new file mode 100644 index 0000000..ca5a091 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738372.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295738404.png b/umn/source/_static/images/en-us_image_0000001295738404.png new file mode 100644 index 0000000..8ee0fa6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738404.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295738420.jpg b/umn/source/_static/images/en-us_image_0000001295738420.jpg new file mode 100644 index 0000000..f4cc24a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738420.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295738508.png b/umn/source/_static/images/en-us_image_0000001295738508.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295738508.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898004.png b/umn/source/_static/images/en-us_image_0000001295898004.png new file mode 100644 index 0000000..c18b666 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898004.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898072.png b/umn/source/_static/images/en-us_image_0000001295898072.png new file mode 100644 index 0000000..883eac5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898072.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898104.png b/umn/source/_static/images/en-us_image_0000001295898104.png new file mode 100644 index 0000000..3c4fa39 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898104.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898140.png b/umn/source/_static/images/en-us_image_0000001295898140.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898140.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898204.png b/umn/source/_static/images/en-us_image_0000001295898204.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898204.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898232.png b/umn/source/_static/images/en-us_image_0000001295898232.png new file mode 100644 index 0000000..6a0422a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898232.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898276.jpg b/umn/source/_static/images/en-us_image_0000001295898276.jpg new file mode 100644 index 0000000..e169ace Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898276.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295898280.jpg b/umn/source/_static/images/en-us_image_0000001295898280.jpg new file mode 100644 index 0000000..8b2f028 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898280.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295898324.jpg b/umn/source/_static/images/en-us_image_0000001295898324.jpg new file mode 100644 index 0000000..8b2f028 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898324.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001295898328.png b/umn/source/_static/images/en-us_image_0000001295898328.png new file mode 100644 index 0000000..e009394 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898328.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898368.png b/umn/source/_static/images/en-us_image_0000001295898368.png new file mode 100644 index 0000000..3b3eb16 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898368.png differ diff --git a/umn/source/_static/images/en-us_image_0000001295898372.jpg b/umn/source/_static/images/en-us_image_0000001295898372.jpg new file mode 100644 index 0000000..faea0f2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001295898372.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296057856.png b/umn/source/_static/images/en-us_image_0000001296057856.png new file mode 100644 index 0000000..389492e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296057856.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296057872.png b/umn/source/_static/images/en-us_image_0000001296057872.png new file mode 100644 index 0000000..73d4ebf Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296057872.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296057976.png b/umn/source/_static/images/en-us_image_0000001296057976.png new file mode 100644 index 0000000..023119a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296057976.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296057980.png b/umn/source/_static/images/en-us_image_0000001296057980.png new file mode 100644 index 0000000..0fdd6f9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296057980.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058020.png b/umn/source/_static/images/en-us_image_0000001296058020.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058020.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058044.jpg b/umn/source/_static/images/en-us_image_0000001296058044.jpg new file mode 100644 index 0000000..5f58063 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058044.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296058048.jpg b/umn/source/_static/images/en-us_image_0000001296058048.jpg new file mode 100644 index 0000000..10b5f9c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058048.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296058072.png b/umn/source/_static/images/en-us_image_0000001296058072.png new file mode 100644 index 0000000..4c5a2b1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058072.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058084.png b/umn/source/_static/images/en-us_image_0000001296058084.png new file mode 100644 index 0000000..68bb105 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058084.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058088.png b/umn/source/_static/images/en-us_image_0000001296058088.png new file mode 100644 index 0000000..2b6d671 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058088.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058144.png b/umn/source/_static/images/en-us_image_0000001296058144.png new file mode 100644 index 0000000..e34efd2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058144.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058164.jpg b/umn/source/_static/images/en-us_image_0000001296058164.jpg new file mode 100644 index 0000000..559cd06 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058164.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296058168.jpg b/umn/source/_static/images/en-us_image_0000001296058168.jpg new file mode 100644 index 0000000..73fa4a5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058168.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296058188.png b/umn/source/_static/images/en-us_image_0000001296058188.png new file mode 100644 index 0000000..fcd0028 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058188.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296058400.png b/umn/source/_static/images/en-us_image_0000001296058400.png new file mode 100644 index 0000000..ed7c746 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296058400.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217480.png b/umn/source/_static/images/en-us_image_0000001296217480.png new file mode 100644 index 0000000..b20a28a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217480.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217500.png b/umn/source/_static/images/en-us_image_0000001296217500.png new file mode 100644 index 0000000..c2c03a2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217500.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217532.gif b/umn/source/_static/images/en-us_image_0000001296217532.gif new file mode 100644 index 0000000..ccbe345 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217532.gif differ diff --git a/umn/source/_static/images/en-us_image_0000001296217540.png b/umn/source/_static/images/en-us_image_0000001296217540.png new file mode 100644 index 0000000..6cf112f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217540.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217548.png b/umn/source/_static/images/en-us_image_0000001296217548.png new file mode 100644 index 0000000..e39d384 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217548.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217608.png b/umn/source/_static/images/en-us_image_0000001296217608.png new file mode 100644 index 0000000..0de59f6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217608.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217612.png b/umn/source/_static/images/en-us_image_0000001296217612.png new file mode 100644 index 0000000..0fdd6f9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217612.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217628.png b/umn/source/_static/images/en-us_image_0000001296217628.png new file mode 100644 index 0000000..99c2c33 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217628.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217676.png b/umn/source/_static/images/en-us_image_0000001296217676.png new file mode 100644 index 0000000..d52f240 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217676.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217700.png b/umn/source/_static/images/en-us_image_0000001296217700.png new file mode 100644 index 0000000..5c46e48 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217700.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217708.png b/umn/source/_static/images/en-us_image_0000001296217708.png new file mode 100644 index 0000000..d3a0df7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217708.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217716.jpg b/umn/source/_static/images/en-us_image_0000001296217716.jpg new file mode 100644 index 0000000..faea0f2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217716.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296217736.png b/umn/source/_static/images/en-us_image_0000001296217736.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217736.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217772.png b/umn/source/_static/images/en-us_image_0000001296217772.png new file mode 100644 index 0000000..44617fb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217772.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217800.jpg b/umn/source/_static/images/en-us_image_0000001296217800.jpg new file mode 100644 index 0000000..8b2f028 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217800.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296217820.png b/umn/source/_static/images/en-us_image_0000001296217820.png new file mode 100644 index 0000000..06e0624 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217820.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217824.png b/umn/source/_static/images/en-us_image_0000001296217824.png new file mode 100644 index 0000000..2bbb148 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217824.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217832.png b/umn/source/_static/images/en-us_image_0000001296217832.png new file mode 100644 index 0000000..4fb736c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217832.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296217924.png b/umn/source/_static/images/en-us_image_0000001296217924.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296217924.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270774.png b/umn/source/_static/images/en-us_image_0000001296270774.png new file mode 100644 index 0000000..3eb1b74 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270774.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270782.png b/umn/source/_static/images/en-us_image_0000001296270782.png new file mode 100644 index 0000000..857ae32 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270782.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270790.png b/umn/source/_static/images/en-us_image_0000001296270790.png new file mode 100644 index 0000000..764b4d5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270790.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270798.png b/umn/source/_static/images/en-us_image_0000001296270798.png new file mode 100644 index 0000000..8d88ad2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270798.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270802.png b/umn/source/_static/images/en-us_image_0000001296270802.png new file mode 100644 index 0000000..2d90166 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270802.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270806.png b/umn/source/_static/images/en-us_image_0000001296270806.png new file mode 100644 index 0000000..b079f2b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270806.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270810.png b/umn/source/_static/images/en-us_image_0000001296270810.png new file mode 100644 index 0000000..3c16908 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270810.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270830.png b/umn/source/_static/images/en-us_image_0000001296270830.png new file mode 100644 index 0000000..0ad908e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270830.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270838.png b/umn/source/_static/images/en-us_image_0000001296270838.png new file mode 100644 index 0000000..5ab5ec2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270838.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270842.png b/umn/source/_static/images/en-us_image_0000001296270842.png new file mode 100644 index 0000000..80dd123 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270842.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270846.png b/umn/source/_static/images/en-us_image_0000001296270846.png new file mode 100644 index 0000000..80dd123 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270846.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270850.png b/umn/source/_static/images/en-us_image_0000001296270850.png new file mode 100644 index 0000000..81c44c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270850.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270854.png b/umn/source/_static/images/en-us_image_0000001296270854.png new file mode 100644 index 0000000..e619bd4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270854.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270858.png b/umn/source/_static/images/en-us_image_0000001296270858.png new file mode 100644 index 0000000..1faeb19 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270858.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270862.png b/umn/source/_static/images/en-us_image_0000001296270862.png new file mode 100644 index 0000000..877b008 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270862.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296270866.png b/umn/source/_static/images/en-us_image_0000001296270866.png new file mode 100644 index 0000000..2a5c0c2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296270866.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296365606.png b/umn/source/_static/images/en-us_image_0000001296365606.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296365606.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430738.png b/umn/source/_static/images/en-us_image_0000001296430738.png new file mode 100644 index 0000000..c5a657e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430738.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430750.png b/umn/source/_static/images/en-us_image_0000001296430750.png new file mode 100644 index 0000000..1415914 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430750.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430758.png b/umn/source/_static/images/en-us_image_0000001296430758.png new file mode 100644 index 0000000..b7ac3a0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430758.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430762.png b/umn/source/_static/images/en-us_image_0000001296430762.png new file mode 100644 index 0000000..d718d57 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430762.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430766.png b/umn/source/_static/images/en-us_image_0000001296430766.png new file mode 100644 index 0000000..6e46656 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430766.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430770.png b/umn/source/_static/images/en-us_image_0000001296430770.png new file mode 100644 index 0000000..e3ce629 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430770.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430774.jpg b/umn/source/_static/images/en-us_image_0000001296430774.jpg new file mode 100644 index 0000000..853f991 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430774.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296430786.png b/umn/source/_static/images/en-us_image_0000001296430786.png new file mode 100644 index 0000000..e3001da Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430786.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430794.jpg b/umn/source/_static/images/en-us_image_0000001296430794.jpg new file mode 100644 index 0000000..72d243e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430794.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296430802.png b/umn/source/_static/images/en-us_image_0000001296430802.png new file mode 100644 index 0000000..63824db Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430802.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430806.png b/umn/source/_static/images/en-us_image_0000001296430806.png new file mode 100644 index 0000000..a37c1df Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430806.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430810.png b/umn/source/_static/images/en-us_image_0000001296430810.png new file mode 100644 index 0000000..00f05f8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430810.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430814.png b/umn/source/_static/images/en-us_image_0000001296430814.png new file mode 100644 index 0000000..62d31c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430814.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430826.png b/umn/source/_static/images/en-us_image_0000001296430826.png new file mode 100644 index 0000000..d23a61d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430826.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296430834.png b/umn/source/_static/images/en-us_image_0000001296430834.png new file mode 100644 index 0000000..9e72590 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296430834.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296525586.png b/umn/source/_static/images/en-us_image_0000001296525586.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296525586.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590594.png b/umn/source/_static/images/en-us_image_0000001296590594.png new file mode 100644 index 0000000..da11755 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590594.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590598.png b/umn/source/_static/images/en-us_image_0000001296590598.png new file mode 100644 index 0000000..37c64a2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590598.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590602.png b/umn/source/_static/images/en-us_image_0000001296590602.png new file mode 100644 index 0000000..0b2ac8f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590602.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590606.png b/umn/source/_static/images/en-us_image_0000001296590606.png new file mode 100644 index 0000000..ed517bd Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590606.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590610.png b/umn/source/_static/images/en-us_image_0000001296590610.png new file mode 100644 index 0000000..6f9b06a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590610.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590614.png b/umn/source/_static/images/en-us_image_0000001296590614.png new file mode 100644 index 0000000..e44b6a9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590614.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590618.png b/umn/source/_static/images/en-us_image_0000001296590618.png new file mode 100644 index 0000000..e619bd4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590618.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590622.png b/umn/source/_static/images/en-us_image_0000001296590622.png new file mode 100644 index 0000000..36f8033 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590622.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590626.png b/umn/source/_static/images/en-us_image_0000001296590626.png new file mode 100644 index 0000000..84bf3e8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590626.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590630.png b/umn/source/_static/images/en-us_image_0000001296590630.png new file mode 100644 index 0000000..7c82a23 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590630.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590634.png b/umn/source/_static/images/en-us_image_0000001296590634.png new file mode 100644 index 0000000..a1d1d30 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590634.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590638.png b/umn/source/_static/images/en-us_image_0000001296590638.png new file mode 100644 index 0000000..f7c043f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590638.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590642.png b/umn/source/_static/images/en-us_image_0000001296590642.png new file mode 100644 index 0000000..5120f0d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590642.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590646.jpg b/umn/source/_static/images/en-us_image_0000001296590646.jpg new file mode 100644 index 0000000..0188634 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590646.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001296590650.png b/umn/source/_static/images/en-us_image_0000001296590650.png new file mode 100644 index 0000000..0c6220c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590650.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590654.png b/umn/source/_static/images/en-us_image_0000001296590654.png new file mode 100644 index 0000000..0b3091d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590654.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590658.png b/umn/source/_static/images/en-us_image_0000001296590658.png new file mode 100644 index 0000000..32593f3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590658.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590662.png b/umn/source/_static/images/en-us_image_0000001296590662.png new file mode 100644 index 0000000..76e7b65 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590662.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590666.png b/umn/source/_static/images/en-us_image_0000001296590666.png new file mode 100644 index 0000000..a37c1df Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590666.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590670.png b/umn/source/_static/images/en-us_image_0000001296590670.png new file mode 100644 index 0000000..0b3091d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590670.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590678.png b/umn/source/_static/images/en-us_image_0000001296590678.png new file mode 100644 index 0000000..8642e66 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590678.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590682.png b/umn/source/_static/images/en-us_image_0000001296590682.png new file mode 100644 index 0000000..ec1190a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590682.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590686.png b/umn/source/_static/images/en-us_image_0000001296590686.png new file mode 100644 index 0000000..7d403ce Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590686.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296590694.png b/umn/source/_static/images/en-us_image_0000001296590694.png new file mode 100644 index 0000000..eb6b1fd Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296590694.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750206.png b/umn/source/_static/images/en-us_image_0000001296750206.png new file mode 100644 index 0000000..a738783 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750206.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750218.png b/umn/source/_static/images/en-us_image_0000001296750218.png new file mode 100644 index 0000000..f14a57a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750218.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750222.png b/umn/source/_static/images/en-us_image_0000001296750222.png new file mode 100644 index 0000000..a645105 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750222.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750230.png b/umn/source/_static/images/en-us_image_0000001296750230.png new file mode 100644 index 0000000..77bb122 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750230.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750238.png b/umn/source/_static/images/en-us_image_0000001296750238.png new file mode 100644 index 0000000..4cd6c24 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750238.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750242.png b/umn/source/_static/images/en-us_image_0000001296750242.png new file mode 100644 index 0000000..ac0b007 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750242.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750246.png b/umn/source/_static/images/en-us_image_0000001296750246.png new file mode 100644 index 0000000..6351839 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750246.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750250.png b/umn/source/_static/images/en-us_image_0000001296750250.png new file mode 100644 index 0000000..2d6eeef Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750250.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750254.png b/umn/source/_static/images/en-us_image_0000001296750254.png new file mode 100644 index 0000000..0645eec Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750254.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750262.png b/umn/source/_static/images/en-us_image_0000001296750262.png new file mode 100644 index 0000000..4a981e5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750262.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750266.png b/umn/source/_static/images/en-us_image_0000001296750266.png new file mode 100644 index 0000000..6e83474 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750266.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750270.png b/umn/source/_static/images/en-us_image_0000001296750270.png new file mode 100644 index 0000000..62d31c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750270.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750274.png b/umn/source/_static/images/en-us_image_0000001296750274.png new file mode 100644 index 0000000..62fc501 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750274.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750278.png b/umn/source/_static/images/en-us_image_0000001296750278.png new file mode 100644 index 0000000..cc66f90 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750278.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750282.png b/umn/source/_static/images/en-us_image_0000001296750282.png new file mode 100644 index 0000000..62fc501 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750282.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750294.png b/umn/source/_static/images/en-us_image_0000001296750294.png new file mode 100644 index 0000000..49d6aa8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750294.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750298.png b/umn/source/_static/images/en-us_image_0000001296750298.png new file mode 100644 index 0000000..e57493b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750298.png differ diff --git a/umn/source/_static/images/en-us_image_0000001296750310.png b/umn/source/_static/images/en-us_image_0000001296750310.png new file mode 100644 index 0000000..ab7c064 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001296750310.png differ diff --git a/umn/source/_static/images/en-us_image_0000001297278204.png b/umn/source/_static/images/en-us_image_0000001297278204.png new file mode 100644 index 0000000..f29fdec Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001297278204.png differ diff --git a/umn/source/_static/images/en-us_image_0000001297838112.png b/umn/source/_static/images/en-us_image_0000001297838112.png new file mode 100644 index 0000000..68eeeeb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001297838112.png differ diff --git a/umn/source/_static/images/en-us_image_0000001297841644.png b/umn/source/_static/images/en-us_image_0000001297841644.png new file mode 100644 index 0000000..f6678bd Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001297841644.png differ diff --git a/umn/source/_static/images/en-us_image_0000001298155472.png b/umn/source/_static/images/en-us_image_0000001298155472.png new file mode 100644 index 0000000..bd61d1d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001298155472.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318119582.png b/umn/source/_static/images/en-us_image_0000001318119582.png new file mode 100644 index 0000000..7295849 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318119582.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318122266.png b/umn/source/_static/images/en-us_image_0000001318122266.png new file mode 100644 index 0000000..d3d8e98 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318122266.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318123498.png b/umn/source/_static/images/en-us_image_0000001318123498.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318123498.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318157588.png b/umn/source/_static/images/en-us_image_0000001318157588.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318157588.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318160568.png b/umn/source/_static/images/en-us_image_0000001318160568.png new file mode 100644 index 0000000..cb1f54b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318160568.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318441686.png b/umn/source/_static/images/en-us_image_0000001318441686.png new file mode 100644 index 0000000..035756c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318441686.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318563432.png b/umn/source/_static/images/en-us_image_0000001318563432.png new file mode 100644 index 0000000..42c8770 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318563432.png differ diff --git a/umn/source/_static/images/en-us_image_0000001318636944.png b/umn/source/_static/images/en-us_image_0000001318636944.png new file mode 100644 index 0000000..e9124b2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001318636944.png differ diff --git a/umn/source/_static/images/en-us_image_0000001337953138.png b/umn/source/_static/images/en-us_image_0000001337953138.png new file mode 100644 index 0000000..cbdd69c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001337953138.png differ diff --git a/umn/source/_static/images/en-us_image_0000001338429394.png b/umn/source/_static/images/en-us_image_0000001338429394.png new file mode 100644 index 0000000..c7cd1b2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001338429394.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348737865.png b/umn/source/_static/images/en-us_image_0000001348737865.png new file mode 100644 index 0000000..b20a28a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348737865.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348737917.png b/umn/source/_static/images/en-us_image_0000001348737917.png new file mode 100644 index 0000000..97ec819 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348737917.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348737925.png b/umn/source/_static/images/en-us_image_0000001348737925.png new file mode 100644 index 0000000..b20a28a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348737925.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348737945.png b/umn/source/_static/images/en-us_image_0000001348737945.png new file mode 100644 index 0000000..44bdf4b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348737945.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738033.png b/umn/source/_static/images/en-us_image_0000001348738033.png new file mode 100644 index 0000000..ebda681 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738033.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738077.gif b/umn/source/_static/images/en-us_image_0000001348738077.gif new file mode 100644 index 0000000..50c2214 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738077.gif differ diff --git a/umn/source/_static/images/en-us_image_0000001348738105.png b/umn/source/_static/images/en-us_image_0000001348738105.png new file mode 100644 index 0000000..7bf8b2e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738105.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738129.png b/umn/source/_static/images/en-us_image_0000001348738129.png new file mode 100644 index 0000000..8b844c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738129.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738141.jpg b/umn/source/_static/images/en-us_image_0000001348738141.jpg new file mode 100644 index 0000000..559cd06 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738141.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001348738185.jpg b/umn/source/_static/images/en-us_image_0000001348738185.jpg new file mode 100644 index 0000000..559cd06 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738185.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001348738189.jpg b/umn/source/_static/images/en-us_image_0000001348738189.jpg new file mode 100644 index 0000000..e96ea7b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738189.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001348738213.png b/umn/source/_static/images/en-us_image_0000001348738213.png new file mode 100644 index 0000000..05c5068 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738213.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738221.png b/umn/source/_static/images/en-us_image_0000001348738221.png new file mode 100644 index 0000000..680a0c7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738221.png differ diff --git a/umn/source/_static/images/en-us_image_0000001348738349.png b/umn/source/_static/images/en-us_image_0000001348738349.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001348738349.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057681.png b/umn/source/_static/images/en-us_image_0000001349057681.png new file mode 100644 index 0000000..d7e8b74 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057681.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057753.jpg b/umn/source/_static/images/en-us_image_0000001349057753.jpg new file mode 100644 index 0000000..a262825 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057753.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349057773.png b/umn/source/_static/images/en-us_image_0000001349057773.png new file mode 100644 index 0000000..449c8df Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057773.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057789.png b/umn/source/_static/images/en-us_image_0000001349057789.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057789.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057797.png b/umn/source/_static/images/en-us_image_0000001349057797.png new file mode 100644 index 0000000..1f18dec Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057797.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057853.png b/umn/source/_static/images/en-us_image_0000001349057853.png new file mode 100644 index 0000000..5198194 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057853.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057865.png b/umn/source/_static/images/en-us_image_0000001349057865.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057865.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057877.png b/umn/source/_static/images/en-us_image_0000001349057877.png new file mode 100644 index 0000000..3eefcea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057877.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057881.png b/umn/source/_static/images/en-us_image_0000001349057881.png new file mode 100644 index 0000000..bdddaf2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057881.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057889.png b/umn/source/_static/images/en-us_image_0000001349057889.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057889.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057897.png b/umn/source/_static/images/en-us_image_0000001349057897.png new file mode 100644 index 0000000..3eefcea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057897.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057901.png b/umn/source/_static/images/en-us_image_0000001349057901.png new file mode 100644 index 0000000..2b6d671 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057901.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057905.png b/umn/source/_static/images/en-us_image_0000001349057905.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057905.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057929.png b/umn/source/_static/images/en-us_image_0000001349057929.png new file mode 100644 index 0000000..b20a28a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057929.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057937.jpg b/umn/source/_static/images/en-us_image_0000001349057937.jpg new file mode 100644 index 0000000..10b5f9c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057937.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349057965.png b/umn/source/_static/images/en-us_image_0000001349057965.png new file mode 100644 index 0000000..d52f240 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057965.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349057985.jpg b/umn/source/_static/images/en-us_image_0000001349057985.jpg new file mode 100644 index 0000000..ca5a091 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349057985.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349058029.png b/umn/source/_static/images/en-us_image_0000001349058029.png new file mode 100644 index 0000000..fe2e502 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349058029.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110441.png b/umn/source/_static/images/en-us_image_0000001349110441.png new file mode 100644 index 0000000..86bdc3a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110441.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110445.jpg b/umn/source/_static/images/en-us_image_0000001349110445.jpg new file mode 100644 index 0000000..a7f5ed6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110445.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349110449.png b/umn/source/_static/images/en-us_image_0000001349110449.png new file mode 100644 index 0000000..3f6cf02 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110449.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110457.png b/umn/source/_static/images/en-us_image_0000001349110457.png new file mode 100644 index 0000000..d0cdfdb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110457.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110469.png b/umn/source/_static/images/en-us_image_0000001349110469.png new file mode 100644 index 0000000..1e23aa1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110469.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110473.png b/umn/source/_static/images/en-us_image_0000001349110473.png new file mode 100644 index 0000000..f8e2e45 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110473.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110485.png b/umn/source/_static/images/en-us_image_0000001349110485.png new file mode 100644 index 0000000..dfc6001 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110485.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110489.png b/umn/source/_static/images/en-us_image_0000001349110489.png new file mode 100644 index 0000000..b87965b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110489.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110493.png b/umn/source/_static/images/en-us_image_0000001349110493.png new file mode 100644 index 0000000..cb798eb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110493.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110501.png b/umn/source/_static/images/en-us_image_0000001349110501.png new file mode 100644 index 0000000..8f635d1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110501.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110505.png b/umn/source/_static/images/en-us_image_0000001349110505.png new file mode 100644 index 0000000..81c44c4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110505.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110509.png b/umn/source/_static/images/en-us_image_0000001349110509.png new file mode 100644 index 0000000..32593f3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110509.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110513.png b/umn/source/_static/images/en-us_image_0000001349110513.png new file mode 100644 index 0000000..e693dad Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110513.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110517.png b/umn/source/_static/images/en-us_image_0000001349110517.png new file mode 100644 index 0000000..8d88ad2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110517.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110525.png b/umn/source/_static/images/en-us_image_0000001349110525.png new file mode 100644 index 0000000..441a126 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110525.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110533.png b/umn/source/_static/images/en-us_image_0000001349110533.png new file mode 100644 index 0000000..09fb619 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110533.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349110589.png b/umn/source/_static/images/en-us_image_0000001349110589.png new file mode 100644 index 0000000..43bd2e7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349110589.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137565.png b/umn/source/_static/images/en-us_image_0000001349137565.png new file mode 100644 index 0000000..389492e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137565.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137569.png b/umn/source/_static/images/en-us_image_0000001349137569.png new file mode 100644 index 0000000..fecc622 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137569.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137577.png b/umn/source/_static/images/en-us_image_0000001349137577.png new file mode 100644 index 0000000..8fb6995 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137577.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137625.png b/umn/source/_static/images/en-us_image_0000001349137625.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137625.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137661.png b/umn/source/_static/images/en-us_image_0000001349137661.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137661.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137681.png b/umn/source/_static/images/en-us_image_0000001349137681.png new file mode 100644 index 0000000..e39d384 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137681.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137689.png b/umn/source/_static/images/en-us_image_0000001349137689.png new file mode 100644 index 0000000..0de59f6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137689.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137697.png b/umn/source/_static/images/en-us_image_0000001349137697.png new file mode 100644 index 0000000..c10e634 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137697.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137705.png b/umn/source/_static/images/en-us_image_0000001349137705.png new file mode 100644 index 0000000..c18b666 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137705.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137749.png b/umn/source/_static/images/en-us_image_0000001349137749.png new file mode 100644 index 0000000..5198194 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137749.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137781.png b/umn/source/_static/images/en-us_image_0000001349137781.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137781.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137801.jpg b/umn/source/_static/images/en-us_image_0000001349137801.jpg new file mode 100644 index 0000000..889594a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137801.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349137813.png b/umn/source/_static/images/en-us_image_0000001349137813.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137813.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137821.png b/umn/source/_static/images/en-us_image_0000001349137821.png new file mode 100644 index 0000000..3781afa Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137821.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137829.jpg b/umn/source/_static/images/en-us_image_0000001349137829.jpg new file mode 100644 index 0000000..73fa4a5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137829.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349137889.png b/umn/source/_static/images/en-us_image_0000001349137889.png new file mode 100644 index 0000000..20f42d9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137889.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349137917.png b/umn/source/_static/images/en-us_image_0000001349137917.png new file mode 100644 index 0000000..d492d95 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349137917.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190301.png b/umn/source/_static/images/en-us_image_0000001349190301.png new file mode 100644 index 0000000..1f2dfe3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190301.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190317.png b/umn/source/_static/images/en-us_image_0000001349190317.png new file mode 100644 index 0000000..6fe1d63 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190317.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190321.png b/umn/source/_static/images/en-us_image_0000001349190321.png new file mode 100644 index 0000000..d8fdbd4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190321.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190329.png b/umn/source/_static/images/en-us_image_0000001349190329.png new file mode 100644 index 0000000..8fd7d47 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190329.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190337.jpg b/umn/source/_static/images/en-us_image_0000001349190337.jpg new file mode 100644 index 0000000..b33c16a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190337.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349190341.png b/umn/source/_static/images/en-us_image_0000001349190341.png new file mode 100644 index 0000000..b2c41a3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190341.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190345.png b/umn/source/_static/images/en-us_image_0000001349190345.png new file mode 100644 index 0000000..40bad8c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190345.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190349.png b/umn/source/_static/images/en-us_image_0000001349190349.png new file mode 100644 index 0000000..4b37e12 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190349.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190357.png b/umn/source/_static/images/en-us_image_0000001349190357.png new file mode 100644 index 0000000..6f045f0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190357.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190361.jpg b/umn/source/_static/images/en-us_image_0000001349190361.jpg new file mode 100644 index 0000000..840c13b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190361.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349190369.png b/umn/source/_static/images/en-us_image_0000001349190369.png new file mode 100644 index 0000000..f40f742 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190369.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190373.png b/umn/source/_static/images/en-us_image_0000001349190373.png new file mode 100644 index 0000000..3d732b5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190373.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190381.png b/umn/source/_static/images/en-us_image_0000001349190381.png new file mode 100644 index 0000000..63824db Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190381.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190385.png b/umn/source/_static/images/en-us_image_0000001349190385.png new file mode 100644 index 0000000..5ab5ec2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190385.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190389.png b/umn/source/_static/images/en-us_image_0000001349190389.png new file mode 100644 index 0000000..c0b9645 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190389.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190397.png b/umn/source/_static/images/en-us_image_0000001349190397.png new file mode 100644 index 0000000..e7d0717 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190397.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349190409.png b/umn/source/_static/images/en-us_image_0000001349190409.png new file mode 100644 index 0000000..eafbe30 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349190409.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257145.png b/umn/source/_static/images/en-us_image_0000001349257145.png new file mode 100644 index 0000000..249ed3f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257145.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257165.png b/umn/source/_static/images/en-us_image_0000001349257165.png new file mode 100644 index 0000000..55c252f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257165.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257205.png b/umn/source/_static/images/en-us_image_0000001349257205.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257205.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257217.png b/umn/source/_static/images/en-us_image_0000001349257217.png new file mode 100644 index 0000000..075d751 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257217.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257245.png b/umn/source/_static/images/en-us_image_0000001349257245.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257245.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257269.png b/umn/source/_static/images/en-us_image_0000001349257269.png new file mode 100644 index 0000000..cd0b130 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257269.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257273.png b/umn/source/_static/images/en-us_image_0000001349257273.png new file mode 100644 index 0000000..023119a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257273.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257277.png b/umn/source/_static/images/en-us_image_0000001349257277.png new file mode 100644 index 0000000..1f18dec Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257277.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257293.png b/umn/source/_static/images/en-us_image_0000001349257293.png new file mode 100644 index 0000000..90d36ab Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257293.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257345.jpg b/umn/source/_static/images/en-us_image_0000001349257345.jpg new file mode 100644 index 0000000..e169ace Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257345.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349257353.png b/umn/source/_static/images/en-us_image_0000001349257353.png new file mode 100644 index 0000000..f1c0e6b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257353.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257365.png b/umn/source/_static/images/en-us_image_0000001349257365.png new file mode 100644 index 0000000..0313312 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257365.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257369.png b/umn/source/_static/images/en-us_image_0000001349257369.png new file mode 100644 index 0000000..cf89bac Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257369.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257373.png b/umn/source/_static/images/en-us_image_0000001349257373.png new file mode 100644 index 0000000..149f0ce Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257373.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257377.png b/umn/source/_static/images/en-us_image_0000001349257377.png new file mode 100644 index 0000000..a6de9d4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257377.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257413.jpg b/umn/source/_static/images/en-us_image_0000001349257413.jpg new file mode 100644 index 0000000..6fca453 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257413.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349257417.jpg b/umn/source/_static/images/en-us_image_0000001349257417.jpg new file mode 100644 index 0000000..ca5a091 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257417.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349257461.jpg b/umn/source/_static/images/en-us_image_0000001349257461.jpg new file mode 100644 index 0000000..73fa4a5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257461.jpg differ diff --git a/umn/source/_static/images/en-us_image_0000001349257469.png b/umn/source/_static/images/en-us_image_0000001349257469.png new file mode 100644 index 0000000..20f42d9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257469.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257477.png b/umn/source/_static/images/en-us_image_0000001349257477.png new file mode 100644 index 0000000..00f92c7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257477.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349257505.png b/umn/source/_static/images/en-us_image_0000001349257505.png new file mode 100644 index 0000000..a6df12b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349257505.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309893.png b/umn/source/_static/images/en-us_image_0000001349309893.png new file mode 100644 index 0000000..8b74c13 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309893.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309905.png b/umn/source/_static/images/en-us_image_0000001349309905.png new file mode 100644 index 0000000..ab7c064 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309905.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309913.png b/umn/source/_static/images/en-us_image_0000001349309913.png new file mode 100644 index 0000000..d3cb0d3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309913.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309925.png b/umn/source/_static/images/en-us_image_0000001349309925.png new file mode 100644 index 0000000..4fde740 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309925.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309929.png b/umn/source/_static/images/en-us_image_0000001349309929.png new file mode 100644 index 0000000..0db7126 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309929.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309933.png b/umn/source/_static/images/en-us_image_0000001349309933.png new file mode 100644 index 0000000..a122f6a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309933.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309941.png b/umn/source/_static/images/en-us_image_0000001349309941.png new file mode 100644 index 0000000..f1d96ff Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309941.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309945.png b/umn/source/_static/images/en-us_image_0000001349309945.png new file mode 100644 index 0000000..a122f6a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309945.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309949.png b/umn/source/_static/images/en-us_image_0000001349309949.png new file mode 100644 index 0000000..0f34b69 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309949.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309953.png b/umn/source/_static/images/en-us_image_0000001349309953.png new file mode 100644 index 0000000..e075af1 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309953.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309957.png b/umn/source/_static/images/en-us_image_0000001349309957.png new file mode 100644 index 0000000..e693dad Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309957.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309965.png b/umn/source/_static/images/en-us_image_0000001349309965.png new file mode 100644 index 0000000..b87965b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309965.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309973.png b/umn/source/_static/images/en-us_image_0000001349309973.png new file mode 100644 index 0000000..aa42878 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309973.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309981.png b/umn/source/_static/images/en-us_image_0000001349309981.png new file mode 100644 index 0000000..ab6322c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309981.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349309985.png b/umn/source/_static/images/en-us_image_0000001349309985.png new file mode 100644 index 0000000..b6ec7a6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349309985.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349310065.png b/umn/source/_static/images/en-us_image_0000001349310065.png new file mode 100644 index 0000000..d81612b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349310065.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390605.png b/umn/source/_static/images/en-us_image_0000001349390605.png new file mode 100644 index 0000000..6f39be6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390605.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390609.png b/umn/source/_static/images/en-us_image_0000001349390609.png new file mode 100644 index 0000000..065b7ed Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390609.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390613.png b/umn/source/_static/images/en-us_image_0000001349390613.png new file mode 100644 index 0000000..2cebb8a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390613.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390617.png b/umn/source/_static/images/en-us_image_0000001349390617.png new file mode 100644 index 0000000..04d163d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390617.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390625.png b/umn/source/_static/images/en-us_image_0000001349390625.png new file mode 100644 index 0000000..d14c054 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390625.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390629.png b/umn/source/_static/images/en-us_image_0000001349390629.png new file mode 100644 index 0000000..e693dad Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390629.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390633.png b/umn/source/_static/images/en-us_image_0000001349390633.png new file mode 100644 index 0000000..2757dee Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390633.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390637.png b/umn/source/_static/images/en-us_image_0000001349390637.png new file mode 100644 index 0000000..f23352b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390637.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390641.png b/umn/source/_static/images/en-us_image_0000001349390641.png new file mode 100644 index 0000000..4e549df Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390641.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390649.png b/umn/source/_static/images/en-us_image_0000001349390649.png new file mode 100644 index 0000000..7bf6578 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390649.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390653.png b/umn/source/_static/images/en-us_image_0000001349390653.png new file mode 100644 index 0000000..acb26ca Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390653.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390661.png b/umn/source/_static/images/en-us_image_0000001349390661.png new file mode 100644 index 0000000..63fd834 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390661.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390665.png b/umn/source/_static/images/en-us_image_0000001349390665.png new file mode 100644 index 0000000..00b8492 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390665.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390669.png b/umn/source/_static/images/en-us_image_0000001349390669.png new file mode 100644 index 0000000..dc39684 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390669.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390673.png b/umn/source/_static/images/en-us_image_0000001349390673.png new file mode 100644 index 0000000..00f05f8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390673.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390677.png b/umn/source/_static/images/en-us_image_0000001349390677.png new file mode 100644 index 0000000..dc39684 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390677.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390685.png b/umn/source/_static/images/en-us_image_0000001349390685.png new file mode 100644 index 0000000..6e46656 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390685.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390693.png b/umn/source/_static/images/en-us_image_0000001349390693.png new file mode 100644 index 0000000..698cb06 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390693.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390697.png b/umn/source/_static/images/en-us_image_0000001349390697.png new file mode 100644 index 0000000..240b987 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390697.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390705.png b/umn/source/_static/images/en-us_image_0000001349390705.png new file mode 100644 index 0000000..d943a33 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390705.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349390773.png b/umn/source/_static/images/en-us_image_0000001349390773.png new file mode 100644 index 0000000..ea37090 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349390773.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349585581.png b/umn/source/_static/images/en-us_image_0000001349585581.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349585581.png differ diff --git a/umn/source/_static/images/en-us_image_0000001349825057.png b/umn/source/_static/images/en-us_image_0000001349825057.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001349825057.png differ diff --git a/umn/source/_static/images/en-us_image_0000001351040425.png b/umn/source/_static/images/en-us_image_0000001351040425.png new file mode 100644 index 0000000..b442074 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001351040425.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369746545.png b/umn/source/_static/images/en-us_image_0000001369746545.png new file mode 100644 index 0000000..887b53f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369746545.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369765657.png b/umn/source/_static/images/en-us_image_0000001369765657.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369765657.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369765661.png b/umn/source/_static/images/en-us_image_0000001369765661.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369765661.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369864353.png b/umn/source/_static/images/en-us_image_0000001369864353.png new file mode 100644 index 0000000..42c8770 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369864353.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369886993.png b/umn/source/_static/images/en-us_image_0000001369886993.png new file mode 100644 index 0000000..7a656dc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369886993.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369925585.png b/umn/source/_static/images/en-us_image_0000001369925585.png new file mode 100644 index 0000000..78d2dc0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369925585.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369944573.png b/umn/source/_static/images/en-us_image_0000001369944573.png new file mode 100644 index 0000000..5d7cf99 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369944573.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369953797.png b/umn/source/_static/images/en-us_image_0000001369953797.png new file mode 100644 index 0000000..5a9db02 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369953797.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369960209.png b/umn/source/_static/images/en-us_image_0000001369960209.png new file mode 100644 index 0000000..fb998e2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369960209.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369965777.png b/umn/source/_static/images/en-us_image_0000001369965777.png new file mode 100644 index 0000000..42c8770 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369965777.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369965781.png b/umn/source/_static/images/en-us_image_0000001369965781.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369965781.png differ diff --git a/umn/source/_static/images/en-us_image_0000001369965785.png b/umn/source/_static/images/en-us_image_0000001369965785.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001369965785.png differ diff --git a/umn/source/_static/images/en-us_image_0000001370061921.png b/umn/source/_static/images/en-us_image_0000001370061921.png new file mode 100644 index 0000000..cc240e2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001370061921.png differ diff --git a/umn/source/_static/images/en-us_image_0000001370085637.png b/umn/source/_static/images/en-us_image_0000001370085637.png new file mode 100644 index 0000000..e104f78 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001370085637.png differ diff --git a/umn/source/_static/images/en-us_image_0000001374635732.png b/umn/source/_static/images/en-us_image_0000001374635732.png new file mode 100644 index 0000000..a18fb35 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001374635732.png differ diff --git a/umn/source/_static/images/en-us_image_0000001375852797.png b/umn/source/_static/images/en-us_image_0000001375852797.png new file mode 100644 index 0000000..4a9a2c7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001375852797.png differ diff --git a/umn/source/_static/images/en-us_image_0000001375901064.png b/umn/source/_static/images/en-us_image_0000001375901064.png new file mode 100644 index 0000000..edd3bbc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001375901064.png differ diff --git a/umn/source/_static/images/en-us_image_0000001376041769.png b/umn/source/_static/images/en-us_image_0000001376041769.png new file mode 100644 index 0000000..4a9a2c7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001376041769.png differ diff --git a/umn/source/_static/images/en-us_image_0000001383088002.png b/umn/source/_static/images/en-us_image_0000001383088002.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001383088002.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388353450.png b/umn/source/_static/images/en-us_image_0000001388353450.png new file mode 100644 index 0000000..6ed555b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388353450.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388541980.png b/umn/source/_static/images/en-us_image_0000001388541980.png new file mode 100644 index 0000000..256b2b0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388541980.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388629905.png b/umn/source/_static/images/en-us_image_0000001388629905.png new file mode 100644 index 0000000..f865cf8 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388629905.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388630241.png b/umn/source/_static/images/en-us_image_0000001388630241.png new file mode 100644 index 0000000..245ba13 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388630241.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388681282.png b/umn/source/_static/images/en-us_image_0000001388681282.png new file mode 100644 index 0000000..944b529 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388681282.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388835338.png b/umn/source/_static/images/en-us_image_0000001388835338.png new file mode 100644 index 0000000..20ab52f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388835338.png differ diff --git a/umn/source/_static/images/en-us_image_0000001388996278.png b/umn/source/_static/images/en-us_image_0000001388996278.png new file mode 100644 index 0000000..80a1146 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001388996278.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390455252.png b/umn/source/_static/images/en-us_image_0000001390455252.png new file mode 100644 index 0000000..597dfa2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390455252.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390459444.png b/umn/source/_static/images/en-us_image_0000001390459444.png new file mode 100644 index 0000000..7089957 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390459444.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390459688.png b/umn/source/_static/images/en-us_image_0000001390459688.png new file mode 100644 index 0000000..68850cc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390459688.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390618644.png b/umn/source/_static/images/en-us_image_0000001390618644.png new file mode 100644 index 0000000..af544e7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390618644.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390618824.png b/umn/source/_static/images/en-us_image_0000001390618824.png new file mode 100644 index 0000000..2edb746 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390618824.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390618884.png b/umn/source/_static/images/en-us_image_0000001390618884.png new file mode 100644 index 0000000..7089957 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390618884.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390619040.png b/umn/source/_static/images/en-us_image_0000001390619040.png new file mode 100644 index 0000000..7089957 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390619040.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390874236.png b/umn/source/_static/images/en-us_image_0000001390874236.png new file mode 100644 index 0000000..5cce832 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390874236.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390878044.png b/umn/source/_static/images/en-us_image_0000001390878044.png new file mode 100644 index 0000000..0a59b4d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390878044.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390934292.png b/umn/source/_static/images/en-us_image_0000001390934292.png new file mode 100644 index 0000000..5803e83 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390934292.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390936180.png b/umn/source/_static/images/en-us_image_0000001390936180.png new file mode 100644 index 0000000..f349b3b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390936180.png differ diff --git a/umn/source/_static/images/en-us_image_0000001390938104.png b/umn/source/_static/images/en-us_image_0000001390938104.png new file mode 100644 index 0000000..1336c74 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001390938104.png differ diff --git a/umn/source/_static/images/en-us_image_0000001405224197.png b/umn/source/_static/images/en-us_image_0000001405224197.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001405224197.png differ diff --git a/umn/source/_static/images/en-us_image_0000001410107141.png b/umn/source/_static/images/en-us_image_0000001410107141.png new file mode 100644 index 0000000..241fee3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001410107141.png differ diff --git a/umn/source/_static/images/en-us_image_0000001426500589.png b/umn/source/_static/images/en-us_image_0000001426500589.png new file mode 100644 index 0000000..241fee3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001426500589.png differ diff --git a/umn/source/_static/images/en-us_image_0000001438033333.png b/umn/source/_static/images/en-us_image_0000001438033333.png new file mode 100644 index 0000000..7681615 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001438033333.png differ diff --git a/umn/source/_static/images/en-us_image_0000001438841461.png b/umn/source/_static/images/en-us_image_0000001438841461.png new file mode 100644 index 0000000..2d0f541 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001438841461.png differ diff --git a/umn/source/_static/images/en-us_image_0000001438954277.png b/umn/source/_static/images/en-us_image_0000001438954277.png new file mode 100644 index 0000000..2d0f541 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001438954277.png differ diff --git a/umn/source/_static/images/en-us_image_0000001439442217.png b/umn/source/_static/images/en-us_image_0000001439442217.png new file mode 100644 index 0000000..a9dc4b5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001439442217.png differ diff --git a/umn/source/_static/images/en-us_image_0000001439562477.png b/umn/source/_static/images/en-us_image_0000001439562477.png new file mode 100644 index 0000000..c457e71 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001439562477.png differ diff --git a/umn/source/_static/images/en-us_image_0000001439594513.png b/umn/source/_static/images/en-us_image_0000001439594513.png new file mode 100644 index 0000000..f6c5d1a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001439594513.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440367085.png b/umn/source/_static/images/en-us_image_0000001440367085.png new file mode 100644 index 0000000..54aafb5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440367085.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440400425.png b/umn/source/_static/images/en-us_image_0000001440400425.png new file mode 100644 index 0000000..640c55d Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440400425.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440610749.png b/umn/source/_static/images/en-us_image_0000001440610749.png new file mode 100644 index 0000000..668439a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440610749.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440726389.png b/umn/source/_static/images/en-us_image_0000001440726389.png new file mode 100644 index 0000000..c38ba9e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440726389.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440858201.png b/umn/source/_static/images/en-us_image_0000001440858201.png new file mode 100644 index 0000000..c1ab408 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440858201.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440858217.png b/umn/source/_static/images/en-us_image_0000001440858217.png new file mode 100644 index 0000000..c1ab408 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440858217.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440858625.png b/umn/source/_static/images/en-us_image_0000001440858625.png new file mode 100644 index 0000000..9939061 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440858625.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440974397.png b/umn/source/_static/images/en-us_image_0000001440974397.png new file mode 100644 index 0000000..5b72b61 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440974397.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440977805.png b/umn/source/_static/images/en-us_image_0000001440977805.png new file mode 100644 index 0000000..c1ab408 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440977805.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440977873.png b/umn/source/_static/images/en-us_image_0000001440977873.png new file mode 100644 index 0000000..c1ab408 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440977873.png differ diff --git a/umn/source/_static/images/en-us_image_0000001440978021.png b/umn/source/_static/images/en-us_image_0000001440978021.png new file mode 100644 index 0000000..c1ab408 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001440978021.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441097913.png b/umn/source/_static/images/en-us_image_0000001441097913.png new file mode 100644 index 0000000..af544e7 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441097913.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441097977.png b/umn/source/_static/images/en-us_image_0000001441097977.png new file mode 100644 index 0000000..2a1fddc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441097977.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441098753.png b/umn/source/_static/images/en-us_image_0000001441098753.png new file mode 100644 index 0000000..fe595cc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441098753.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441155405.png b/umn/source/_static/images/en-us_image_0000001441155405.png new file mode 100644 index 0000000..4e289fa Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441155405.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441218249.png b/umn/source/_static/images/en-us_image_0000001441218249.png new file mode 100644 index 0000000..7089957 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441218249.png differ diff --git a/umn/source/_static/images/en-us_image_0000001441218685.png b/umn/source/_static/images/en-us_image_0000001441218685.png new file mode 100644 index 0000000..d89f3b9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001441218685.png differ diff --git a/umn/source/_static/images/en-us_image_0263895369.png b/umn/source/_static/images/en-us_image_0263895369.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895369.png differ diff --git a/umn/source/_static/images/en-us_image_0263895376.png b/umn/source/_static/images/en-us_image_0263895376.png new file mode 100644 index 0000000..269d4c9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895376.png differ diff --git a/umn/source/_static/images/en-us_image_0263895382.png b/umn/source/_static/images/en-us_image_0263895382.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895382.png differ diff --git a/umn/source/_static/images/en-us_image_0263895386.png b/umn/source/_static/images/en-us_image_0263895386.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895386.png differ diff --git a/umn/source/_static/images/en-us_image_0263895407.png b/umn/source/_static/images/en-us_image_0263895407.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895407.png differ diff --git a/umn/source/_static/images/en-us_image_0263895412.png b/umn/source/_static/images/en-us_image_0263895412.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895412.png differ diff --git a/umn/source/_static/images/en-us_image_0263895445.png b/umn/source/_static/images/en-us_image_0263895445.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895445.png differ diff --git a/umn/source/_static/images/en-us_image_0263895453.png b/umn/source/_static/images/en-us_image_0263895453.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895453.png differ diff --git a/umn/source/_static/images/en-us_image_0263895513.png b/umn/source/_static/images/en-us_image_0263895513.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895513.png differ diff --git a/umn/source/_static/images/en-us_image_0263895526.png b/umn/source/_static/images/en-us_image_0263895526.png new file mode 100644 index 0000000..430b26e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895526.png differ diff --git a/umn/source/_static/images/en-us_image_0263895532.png b/umn/source/_static/images/en-us_image_0263895532.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895532.png differ diff --git a/umn/source/_static/images/en-us_image_0263895540.png b/umn/source/_static/images/en-us_image_0263895540.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895540.png differ diff --git a/umn/source/_static/images/en-us_image_0263895551.png b/umn/source/_static/images/en-us_image_0263895551.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895551.png differ diff --git a/umn/source/_static/images/en-us_image_0263895574.png b/umn/source/_static/images/en-us_image_0263895574.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895574.png differ diff --git a/umn/source/_static/images/en-us_image_0263895577.png b/umn/source/_static/images/en-us_image_0263895577.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895577.png differ diff --git a/umn/source/_static/images/en-us_image_0263895589.png b/umn/source/_static/images/en-us_image_0263895589.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895589.png differ diff --git a/umn/source/_static/images/en-us_image_0263895594.png b/umn/source/_static/images/en-us_image_0263895594.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895594.png differ diff --git a/umn/source/_static/images/en-us_image_0263895607.png b/umn/source/_static/images/en-us_image_0263895607.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895607.png differ diff --git a/umn/source/_static/images/en-us_image_0263895617.png b/umn/source/_static/images/en-us_image_0263895617.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895617.png differ diff --git a/umn/source/_static/images/en-us_image_0263895663.png b/umn/source/_static/images/en-us_image_0263895663.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895663.png differ diff --git a/umn/source/_static/images/en-us_image_0263895680.png b/umn/source/_static/images/en-us_image_0263895680.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895680.png differ diff --git a/umn/source/_static/images/en-us_image_0263895683.png b/umn/source/_static/images/en-us_image_0263895683.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895683.png differ diff --git a/umn/source/_static/images/en-us_image_0263895733.png b/umn/source/_static/images/en-us_image_0263895733.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895733.png differ diff --git a/umn/source/_static/images/en-us_image_0263895749.png b/umn/source/_static/images/en-us_image_0263895749.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895749.png differ diff --git a/umn/source/_static/images/en-us_image_0263895751.png b/umn/source/_static/images/en-us_image_0263895751.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895751.png differ diff --git a/umn/source/_static/images/en-us_image_0263895771.png b/umn/source/_static/images/en-us_image_0263895771.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895771.png differ diff --git a/umn/source/_static/images/en-us_image_0263895776.png b/umn/source/_static/images/en-us_image_0263895776.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895776.png differ diff --git a/umn/source/_static/images/en-us_image_0263895789.png b/umn/source/_static/images/en-us_image_0263895789.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895789.png differ diff --git a/umn/source/_static/images/en-us_image_0263895796.png b/umn/source/_static/images/en-us_image_0263895796.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895796.png differ diff --git a/umn/source/_static/images/en-us_image_0263895802.png b/umn/source/_static/images/en-us_image_0263895802.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895802.png differ diff --git a/umn/source/_static/images/en-us_image_0263895811.png b/umn/source/_static/images/en-us_image_0263895811.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895811.png differ diff --git a/umn/source/_static/images/en-us_image_0263895818.jpg b/umn/source/_static/images/en-us_image_0263895818.jpg new file mode 100644 index 0000000..269c975 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895818.jpg differ diff --git a/umn/source/_static/images/en-us_image_0263895827.png b/umn/source/_static/images/en-us_image_0263895827.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895827.png differ diff --git a/umn/source/_static/images/en-us_image_0263895859.png b/umn/source/_static/images/en-us_image_0263895859.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895859.png differ diff --git a/umn/source/_static/images/en-us_image_0263895883.png b/umn/source/_static/images/en-us_image_0263895883.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263895883.png differ diff --git a/umn/source/_static/images/en-us_image_0263899217.png b/umn/source/_static/images/en-us_image_0263899217.png new file mode 100644 index 0000000..d0b004f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899217.png differ diff --git a/umn/source/_static/images/en-us_image_0263899218.png b/umn/source/_static/images/en-us_image_0263899218.png new file mode 100644 index 0000000..f7d769f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899218.png differ diff --git a/umn/source/_static/images/en-us_image_0263899222.png b/umn/source/_static/images/en-us_image_0263899222.png new file mode 100644 index 0000000..cf28606 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899222.png differ diff --git a/umn/source/_static/images/en-us_image_0263899235.png b/umn/source/_static/images/en-us_image_0263899235.png new file mode 100644 index 0000000..7a576d0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899235.png differ diff --git a/umn/source/_static/images/en-us_image_0263899238.png b/umn/source/_static/images/en-us_image_0263899238.png new file mode 100644 index 0000000..e837946 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899238.png differ diff --git a/umn/source/_static/images/en-us_image_0263899257.png b/umn/source/_static/images/en-us_image_0263899257.png new file mode 100644 index 0000000..f63e57c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899257.png differ diff --git a/umn/source/_static/images/en-us_image_0263899268.png b/umn/source/_static/images/en-us_image_0263899268.png new file mode 100644 index 0000000..189cc44 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899268.png differ diff --git a/umn/source/_static/images/en-us_image_0263899271.png b/umn/source/_static/images/en-us_image_0263899271.png new file mode 100644 index 0000000..b220f28 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899271.png differ diff --git a/umn/source/_static/images/en-us_image_0263899280.png b/umn/source/_static/images/en-us_image_0263899280.png new file mode 100644 index 0000000..2763efb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899280.png differ diff --git a/umn/source/_static/images/en-us_image_0263899283.png b/umn/source/_static/images/en-us_image_0263899283.png new file mode 100644 index 0000000..90cb2a2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899283.png differ diff --git a/umn/source/_static/images/en-us_image_0263899288.png b/umn/source/_static/images/en-us_image_0263899288.png new file mode 100644 index 0000000..3dc4101 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899288.png differ diff --git a/umn/source/_static/images/en-us_image_0263899289.png b/umn/source/_static/images/en-us_image_0263899289.png new file mode 100644 index 0000000..a1533d6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899289.png differ diff --git a/umn/source/_static/images/en-us_image_0263899291.png b/umn/source/_static/images/en-us_image_0263899291.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899291.png differ diff --git a/umn/source/_static/images/en-us_image_0263899293.png b/umn/source/_static/images/en-us_image_0263899293.png new file mode 100644 index 0000000..a2bd0b9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899293.png differ diff --git a/umn/source/_static/images/en-us_image_0263899299.png b/umn/source/_static/images/en-us_image_0263899299.png new file mode 100644 index 0000000..c15acae Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899299.png differ diff --git a/umn/source/_static/images/en-us_image_0263899304.png b/umn/source/_static/images/en-us_image_0263899304.png new file mode 100644 index 0000000..7a576d0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899304.png differ diff --git a/umn/source/_static/images/en-us_image_0263899311.png b/umn/source/_static/images/en-us_image_0263899311.png new file mode 100644 index 0000000..f2663b3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899311.png differ diff --git a/umn/source/_static/images/en-us_image_0263899316.png b/umn/source/_static/images/en-us_image_0263899316.png new file mode 100644 index 0000000..31dd12f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899316.png differ diff --git a/umn/source/_static/images/en-us_image_0263899322.png b/umn/source/_static/images/en-us_image_0263899322.png new file mode 100644 index 0000000..e837946 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899322.png differ diff --git a/umn/source/_static/images/en-us_image_0263899323.png b/umn/source/_static/images/en-us_image_0263899323.png new file mode 100644 index 0000000..3939730 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899323.png differ diff --git a/umn/source/_static/images/en-us_image_0263899329.png b/umn/source/_static/images/en-us_image_0263899329.png new file mode 100644 index 0000000..3dc4101 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899329.png differ diff --git a/umn/source/_static/images/en-us_image_0263899334.png b/umn/source/_static/images/en-us_image_0263899334.png new file mode 100644 index 0000000..6cb0014 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899334.png differ diff --git a/umn/source/_static/images/en-us_image_0263899339.png b/umn/source/_static/images/en-us_image_0263899339.png new file mode 100644 index 0000000..a2bd0b9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899339.png differ diff --git a/umn/source/_static/images/en-us_image_0263899343.png b/umn/source/_static/images/en-us_image_0263899343.png new file mode 100644 index 0000000..ff1c635 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899343.png differ diff --git a/umn/source/_static/images/en-us_image_0263899363.png b/umn/source/_static/images/en-us_image_0263899363.png new file mode 100644 index 0000000..a2bd0b9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899363.png differ diff --git a/umn/source/_static/images/en-us_image_0263899376.png b/umn/source/_static/images/en-us_image_0263899376.png new file mode 100644 index 0000000..bf999ea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899376.png differ diff --git a/umn/source/_static/images/en-us_image_0263899383.png b/umn/source/_static/images/en-us_image_0263899383.png new file mode 100644 index 0000000..7b087d4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899383.png differ diff --git a/umn/source/_static/images/en-us_image_0263899384.png b/umn/source/_static/images/en-us_image_0263899384.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899384.png differ diff --git a/umn/source/_static/images/en-us_image_0263899392.png b/umn/source/_static/images/en-us_image_0263899392.png new file mode 100644 index 0000000..6cb0014 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899392.png differ diff --git a/umn/source/_static/images/en-us_image_0263899401.png b/umn/source/_static/images/en-us_image_0263899401.png new file mode 100644 index 0000000..6cb0014 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899401.png differ diff --git a/umn/source/_static/images/en-us_image_0263899403.png b/umn/source/_static/images/en-us_image_0263899403.png new file mode 100644 index 0000000..fc77b19 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899403.png differ diff --git a/umn/source/_static/images/en-us_image_0263899406.png b/umn/source/_static/images/en-us_image_0263899406.png new file mode 100644 index 0000000..6fa6320 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899406.png differ diff --git a/umn/source/_static/images/en-us_image_0263899409.png b/umn/source/_static/images/en-us_image_0263899409.png new file mode 100644 index 0000000..bf43e20 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899409.png differ diff --git a/umn/source/_static/images/en-us_image_0263899411.png b/umn/source/_static/images/en-us_image_0263899411.png new file mode 100644 index 0000000..90cb2a2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899411.png differ diff --git a/umn/source/_static/images/en-us_image_0263899420.png b/umn/source/_static/images/en-us_image_0263899420.png new file mode 100644 index 0000000..244891e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899420.png differ diff --git a/umn/source/_static/images/en-us_image_0263899424.png b/umn/source/_static/images/en-us_image_0263899424.png new file mode 100644 index 0000000..f2663b3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899424.png differ diff --git a/umn/source/_static/images/en-us_image_0263899436.png b/umn/source/_static/images/en-us_image_0263899436.png new file mode 100644 index 0000000..5a5ab71 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899436.png differ diff --git a/umn/source/_static/images/en-us_image_0263899446.png b/umn/source/_static/images/en-us_image_0263899446.png new file mode 100644 index 0000000..f016a8c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899446.png differ diff --git a/umn/source/_static/images/en-us_image_0263899452.png b/umn/source/_static/images/en-us_image_0263899452.png new file mode 100644 index 0000000..7b087d4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899452.png differ diff --git a/umn/source/_static/images/en-us_image_0263899453.png b/umn/source/_static/images/en-us_image_0263899453.png new file mode 100644 index 0000000..5af109f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899453.png differ diff --git a/umn/source/_static/images/en-us_image_0263899454.png b/umn/source/_static/images/en-us_image_0263899454.png new file mode 100644 index 0000000..2763efb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899454.png differ diff --git a/umn/source/_static/images/en-us_image_0263899456.png b/umn/source/_static/images/en-us_image_0263899456.png new file mode 100644 index 0000000..5398ae0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899456.png differ diff --git a/umn/source/_static/images/en-us_image_0263899458.png b/umn/source/_static/images/en-us_image_0263899458.png new file mode 100644 index 0000000..bf999ea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899458.png differ diff --git a/umn/source/_static/images/en-us_image_0263899471.png b/umn/source/_static/images/en-us_image_0263899471.png new file mode 100644 index 0000000..2fb0abd Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899471.png differ diff --git a/umn/source/_static/images/en-us_image_0263899476.png b/umn/source/_static/images/en-us_image_0263899476.png new file mode 100644 index 0000000..a2bd0b9 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899476.png differ diff --git a/umn/source/_static/images/en-us_image_0263899493.png b/umn/source/_static/images/en-us_image_0263899493.png new file mode 100644 index 0000000..6d6b073 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899493.png differ diff --git a/umn/source/_static/images/en-us_image_0263899495.png b/umn/source/_static/images/en-us_image_0263899495.png new file mode 100644 index 0000000..b6496e0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899495.png differ diff --git a/umn/source/_static/images/en-us_image_0263899496.png b/umn/source/_static/images/en-us_image_0263899496.png new file mode 100644 index 0000000..f7d769f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899496.png differ diff --git a/umn/source/_static/images/en-us_image_0263899498.png b/umn/source/_static/images/en-us_image_0263899498.png new file mode 100644 index 0000000..f7d769f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899498.png differ diff --git a/umn/source/_static/images/en-us_image_0263899504.png b/umn/source/_static/images/en-us_image_0263899504.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899504.png differ diff --git a/umn/source/_static/images/en-us_image_0263899520.png b/umn/source/_static/images/en-us_image_0263899520.png new file mode 100644 index 0000000..df8beea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899520.png differ diff --git a/umn/source/_static/images/en-us_image_0263899528.png b/umn/source/_static/images/en-us_image_0263899528.png new file mode 100644 index 0000000..3dc4101 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899528.png differ diff --git a/umn/source/_static/images/en-us_image_0263899531.png b/umn/source/_static/images/en-us_image_0263899531.png new file mode 100644 index 0000000..9c40b10 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899531.png differ diff --git a/umn/source/_static/images/en-us_image_0263899538.png b/umn/source/_static/images/en-us_image_0263899538.png new file mode 100644 index 0000000..a4a1f3b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899538.png differ diff --git a/umn/source/_static/images/en-us_image_0263899546.png b/umn/source/_static/images/en-us_image_0263899546.png new file mode 100644 index 0000000..131087a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899546.png differ diff --git a/umn/source/_static/images/en-us_image_0263899555.png b/umn/source/_static/images/en-us_image_0263899555.png new file mode 100644 index 0000000..fdc1de4 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899555.png differ diff --git a/umn/source/_static/images/en-us_image_0263899556.png b/umn/source/_static/images/en-us_image_0263899556.png new file mode 100644 index 0000000..72ba087 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899556.png differ diff --git a/umn/source/_static/images/en-us_image_0263899570.png b/umn/source/_static/images/en-us_image_0263899570.png new file mode 100644 index 0000000..f63e57c Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899570.png differ diff --git a/umn/source/_static/images/en-us_image_0263899582.png b/umn/source/_static/images/en-us_image_0263899582.png new file mode 100644 index 0000000..1199dfc Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899582.png differ diff --git a/umn/source/_static/images/en-us_image_0263899589.png b/umn/source/_static/images/en-us_image_0263899589.png new file mode 100644 index 0000000..b3f1439 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899589.png differ diff --git a/umn/source/_static/images/en-us_image_0263899593.png b/umn/source/_static/images/en-us_image_0263899593.png new file mode 100644 index 0000000..188277e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899593.png differ diff --git a/umn/source/_static/images/en-us_image_0263899597.png b/umn/source/_static/images/en-us_image_0263899597.png new file mode 100644 index 0000000..e9124b2 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899597.png differ diff --git a/umn/source/_static/images/en-us_image_0263899610.png b/umn/source/_static/images/en-us_image_0263899610.png new file mode 100644 index 0000000..2fb0abd Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899610.png differ diff --git a/umn/source/_static/images/en-us_image_0263899616.png b/umn/source/_static/images/en-us_image_0263899616.png new file mode 100644 index 0000000..48f827b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899616.png differ diff --git a/umn/source/_static/images/en-us_image_0263899617.png b/umn/source/_static/images/en-us_image_0263899617.png new file mode 100644 index 0000000..18df814 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899617.png differ diff --git a/umn/source/_static/images/en-us_image_0263899621.png b/umn/source/_static/images/en-us_image_0263899621.png new file mode 100644 index 0000000..bf999ea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899621.png differ diff --git a/umn/source/_static/images/en-us_image_0263899635.png b/umn/source/_static/images/en-us_image_0263899635.png new file mode 100644 index 0000000..bf999ea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899635.png differ diff --git a/umn/source/_static/images/en-us_image_0263899637.png b/umn/source/_static/images/en-us_image_0263899637.png new file mode 100644 index 0000000..31dd12f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899637.png differ diff --git a/umn/source/_static/images/en-us_image_0263899641.png b/umn/source/_static/images/en-us_image_0263899641.png new file mode 100644 index 0000000..cccdb27 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899641.png differ diff --git a/umn/source/_static/images/en-us_image_0263899644.png b/umn/source/_static/images/en-us_image_0263899644.png new file mode 100644 index 0000000..60720ff Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899644.png differ diff --git a/umn/source/_static/images/en-us_image_0263899649.png b/umn/source/_static/images/en-us_image_0263899649.png new file mode 100644 index 0000000..d4ef748 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899649.png differ diff --git a/umn/source/_static/images/en-us_image_0263899656.png b/umn/source/_static/images/en-us_image_0263899656.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899656.png differ diff --git a/umn/source/_static/images/en-us_image_0263899662.png b/umn/source/_static/images/en-us_image_0263899662.png new file mode 100644 index 0000000..189cc44 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899662.png differ diff --git a/umn/source/_static/images/en-us_image_0263899673.png b/umn/source/_static/images/en-us_image_0263899673.png new file mode 100644 index 0000000..df8beea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899673.png differ diff --git a/umn/source/_static/images/en-us_image_0263899675.png b/umn/source/_static/images/en-us_image_0263899675.png new file mode 100644 index 0000000..48f827b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0263899675.png differ diff --git a/umn/source/_static/images/en-us_image_0264281176.png b/umn/source/_static/images/en-us_image_0264281176.png new file mode 100644 index 0000000..3eefcea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0264281176.png differ diff --git a/umn/source/_static/images/en-us_image_0265768517.png b/umn/source/_static/images/en-us_image_0265768517.png new file mode 100644 index 0000000..9ccebe3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0265768517.png differ diff --git a/umn/source/_static/images/en-us_image_0267694670.jpg b/umn/source/_static/images/en-us_image_0267694670.jpg new file mode 100644 index 0000000..ded4813 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0267694670.jpg differ diff --git a/umn/source/_static/images/en-us_image_0268284607.png b/umn/source/_static/images/en-us_image_0268284607.png new file mode 100644 index 0000000..4199adb Binary files /dev/null and b/umn/source/_static/images/en-us_image_0268284607.png differ diff --git a/umn/source/_static/images/en-us_image_0268298534.png b/umn/source/_static/images/en-us_image_0268298534.png new file mode 100644 index 0000000..2bde5e3 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0268298534.png differ diff --git a/umn/source/_static/images/en-us_image_0269383808.png b/umn/source/_static/images/en-us_image_0269383808.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383808.png differ diff --git a/umn/source/_static/images/en-us_image_0269383809.png b/umn/source/_static/images/en-us_image_0269383809.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383809.png differ diff --git a/umn/source/_static/images/en-us_image_0269383810.png b/umn/source/_static/images/en-us_image_0269383810.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383810.png differ diff --git a/umn/source/_static/images/en-us_image_0269383814.png b/umn/source/_static/images/en-us_image_0269383814.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383814.png differ diff --git a/umn/source/_static/images/en-us_image_0269383815.png b/umn/source/_static/images/en-us_image_0269383815.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383815.png differ diff --git a/umn/source/_static/images/en-us_image_0269383816.png b/umn/source/_static/images/en-us_image_0269383816.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383816.png differ diff --git a/umn/source/_static/images/en-us_image_0269383817.png b/umn/source/_static/images/en-us_image_0269383817.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383817.png differ diff --git a/umn/source/_static/images/en-us_image_0269383818.png b/umn/source/_static/images/en-us_image_0269383818.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383818.png differ diff --git a/umn/source/_static/images/en-us_image_0269383822.png b/umn/source/_static/images/en-us_image_0269383822.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383822.png differ diff --git a/umn/source/_static/images/en-us_image_0269383823.png b/umn/source/_static/images/en-us_image_0269383823.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383823.png differ diff --git a/umn/source/_static/images/en-us_image_0269383824.png b/umn/source/_static/images/en-us_image_0269383824.png new file mode 100644 index 0000000..1d12a23 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383824.png differ diff --git a/umn/source/_static/images/en-us_image_0269383826.png b/umn/source/_static/images/en-us_image_0269383826.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383826.png differ diff --git a/umn/source/_static/images/en-us_image_0269383828.png b/umn/source/_static/images/en-us_image_0269383828.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383828.png differ diff --git a/umn/source/_static/images/en-us_image_0269383829.png b/umn/source/_static/images/en-us_image_0269383829.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383829.png differ diff --git a/umn/source/_static/images/en-us_image_0269383830.png b/umn/source/_static/images/en-us_image_0269383830.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383830.png differ diff --git a/umn/source/_static/images/en-us_image_0269383831.png b/umn/source/_static/images/en-us_image_0269383831.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383831.png differ diff --git a/umn/source/_static/images/en-us_image_0269383832.png b/umn/source/_static/images/en-us_image_0269383832.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383832.png differ diff --git a/umn/source/_static/images/en-us_image_0269383834.png b/umn/source/_static/images/en-us_image_0269383834.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383834.png differ diff --git a/umn/source/_static/images/en-us_image_0269383843.png b/umn/source/_static/images/en-us_image_0269383843.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383843.png differ diff --git a/umn/source/_static/images/en-us_image_0269383844.png b/umn/source/_static/images/en-us_image_0269383844.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383844.png differ diff --git a/umn/source/_static/images/en-us_image_0269383845.png b/umn/source/_static/images/en-us_image_0269383845.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383845.png differ diff --git a/umn/source/_static/images/en-us_image_0269383846.png b/umn/source/_static/images/en-us_image_0269383846.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383846.png differ diff --git a/umn/source/_static/images/en-us_image_0269383850.png b/umn/source/_static/images/en-us_image_0269383850.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383850.png differ diff --git a/umn/source/_static/images/en-us_image_0269383851.png b/umn/source/_static/images/en-us_image_0269383851.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383851.png differ diff --git a/umn/source/_static/images/en-us_image_0269383852.png b/umn/source/_static/images/en-us_image_0269383852.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383852.png differ diff --git a/umn/source/_static/images/en-us_image_0269383855.jpg b/umn/source/_static/images/en-us_image_0269383855.jpg new file mode 100644 index 0000000..5876faf Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383855.jpg differ diff --git a/umn/source/_static/images/en-us_image_0269383856.png b/umn/source/_static/images/en-us_image_0269383856.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383856.png differ diff --git a/umn/source/_static/images/en-us_image_0269383857.png b/umn/source/_static/images/en-us_image_0269383857.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383857.png differ diff --git a/umn/source/_static/images/en-us_image_0269383872.png b/umn/source/_static/images/en-us_image_0269383872.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383872.png differ diff --git a/umn/source/_static/images/en-us_image_0269383873.png b/umn/source/_static/images/en-us_image_0269383873.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383873.png differ diff --git a/umn/source/_static/images/en-us_image_0269383875.png b/umn/source/_static/images/en-us_image_0269383875.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383875.png differ diff --git a/umn/source/_static/images/en-us_image_0269383876.png b/umn/source/_static/images/en-us_image_0269383876.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383876.png differ diff --git a/umn/source/_static/images/en-us_image_0269383877.png b/umn/source/_static/images/en-us_image_0269383877.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383877.png differ diff --git a/umn/source/_static/images/en-us_image_0269383878.png b/umn/source/_static/images/en-us_image_0269383878.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383878.png differ diff --git a/umn/source/_static/images/en-us_image_0269383880.png b/umn/source/_static/images/en-us_image_0269383880.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383880.png differ diff --git a/umn/source/_static/images/en-us_image_0269383881.png b/umn/source/_static/images/en-us_image_0269383881.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383881.png differ diff --git a/umn/source/_static/images/en-us_image_0269383882.png b/umn/source/_static/images/en-us_image_0269383882.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383882.png differ diff --git a/umn/source/_static/images/en-us_image_0269383883.png b/umn/source/_static/images/en-us_image_0269383883.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383883.png differ diff --git a/umn/source/_static/images/en-us_image_0269383884.png b/umn/source/_static/images/en-us_image_0269383884.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383884.png differ diff --git a/umn/source/_static/images/en-us_image_0269383889.png b/umn/source/_static/images/en-us_image_0269383889.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383889.png differ diff --git a/umn/source/_static/images/en-us_image_0269383890.png b/umn/source/_static/images/en-us_image_0269383890.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383890.png differ diff --git a/umn/source/_static/images/en-us_image_0269383906.png b/umn/source/_static/images/en-us_image_0269383906.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383906.png differ diff --git a/umn/source/_static/images/en-us_image_0269383907.png b/umn/source/_static/images/en-us_image_0269383907.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383907.png differ diff --git a/umn/source/_static/images/en-us_image_0269383908.png b/umn/source/_static/images/en-us_image_0269383908.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383908.png differ diff --git a/umn/source/_static/images/en-us_image_0269383909.png b/umn/source/_static/images/en-us_image_0269383909.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383909.png differ diff --git a/umn/source/_static/images/en-us_image_0269383915.png b/umn/source/_static/images/en-us_image_0269383915.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383915.png differ diff --git a/umn/source/_static/images/en-us_image_0269383916.png b/umn/source/_static/images/en-us_image_0269383916.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383916.png differ diff --git a/umn/source/_static/images/en-us_image_0269383917.png b/umn/source/_static/images/en-us_image_0269383917.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383917.png differ diff --git a/umn/source/_static/images/en-us_image_0269383918.png b/umn/source/_static/images/en-us_image_0269383918.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383918.png differ diff --git a/umn/source/_static/images/en-us_image_0269383919.png b/umn/source/_static/images/en-us_image_0269383919.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383919.png differ diff --git a/umn/source/_static/images/en-us_image_0269383920.png b/umn/source/_static/images/en-us_image_0269383920.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383920.png differ diff --git a/umn/source/_static/images/en-us_image_0269383921.jpg b/umn/source/_static/images/en-us_image_0269383921.jpg new file mode 100644 index 0000000..8821f24 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383921.jpg differ diff --git a/umn/source/_static/images/en-us_image_0269383922.png b/umn/source/_static/images/en-us_image_0269383922.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383922.png differ diff --git a/umn/source/_static/images/en-us_image_0269383923.png b/umn/source/_static/images/en-us_image_0269383923.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383923.png differ diff --git a/umn/source/_static/images/en-us_image_0269383924.png b/umn/source/_static/images/en-us_image_0269383924.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383924.png differ diff --git a/umn/source/_static/images/en-us_image_0269383925.png b/umn/source/_static/images/en-us_image_0269383925.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383925.png differ diff --git a/umn/source/_static/images/en-us_image_0269383926.png b/umn/source/_static/images/en-us_image_0269383926.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383926.png differ diff --git a/umn/source/_static/images/en-us_image_0269383927.png b/umn/source/_static/images/en-us_image_0269383927.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383927.png differ diff --git a/umn/source/_static/images/en-us_image_0269383928.png b/umn/source/_static/images/en-us_image_0269383928.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383928.png differ diff --git a/umn/source/_static/images/en-us_image_0269383929.png b/umn/source/_static/images/en-us_image_0269383929.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383929.png differ diff --git a/umn/source/_static/images/en-us_image_0269383930.png b/umn/source/_static/images/en-us_image_0269383930.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383930.png differ diff --git a/umn/source/_static/images/en-us_image_0269383932.png b/umn/source/_static/images/en-us_image_0269383932.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383932.png differ diff --git a/umn/source/_static/images/en-us_image_0269383933.png b/umn/source/_static/images/en-us_image_0269383933.png new file mode 100644 index 0000000..aac0a1a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383933.png differ diff --git a/umn/source/_static/images/en-us_image_0269383934.png b/umn/source/_static/images/en-us_image_0269383934.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383934.png differ diff --git a/umn/source/_static/images/en-us_image_0269383936.png b/umn/source/_static/images/en-us_image_0269383936.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383936.png differ diff --git a/umn/source/_static/images/en-us_image_0269383940.png b/umn/source/_static/images/en-us_image_0269383940.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383940.png differ diff --git a/umn/source/_static/images/en-us_image_0269383942.png b/umn/source/_static/images/en-us_image_0269383942.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383942.png differ diff --git a/umn/source/_static/images/en-us_image_0269383943.png b/umn/source/_static/images/en-us_image_0269383943.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383943.png differ diff --git a/umn/source/_static/images/en-us_image_0269383945.png b/umn/source/_static/images/en-us_image_0269383945.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383945.png differ diff --git a/umn/source/_static/images/en-us_image_0269383946.png b/umn/source/_static/images/en-us_image_0269383946.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383946.png differ diff --git a/umn/source/_static/images/en-us_image_0269383949.png b/umn/source/_static/images/en-us_image_0269383949.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383949.png differ diff --git a/umn/source/_static/images/en-us_image_0269383950.gif b/umn/source/_static/images/en-us_image_0269383950.gif new file mode 100644 index 0000000..4cab727 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383950.gif differ diff --git a/umn/source/_static/images/en-us_image_0269383952.png b/umn/source/_static/images/en-us_image_0269383952.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383952.png differ diff --git a/umn/source/_static/images/en-us_image_0269383953.png b/umn/source/_static/images/en-us_image_0269383953.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383953.png differ diff --git a/umn/source/_static/images/en-us_image_0269383956.png b/umn/source/_static/images/en-us_image_0269383956.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383956.png differ diff --git a/umn/source/_static/images/en-us_image_0269383957.png b/umn/source/_static/images/en-us_image_0269383957.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383957.png differ diff --git a/umn/source/_static/images/en-us_image_0269383958.png b/umn/source/_static/images/en-us_image_0269383958.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383958.png differ diff --git a/umn/source/_static/images/en-us_image_0269383960.png b/umn/source/_static/images/en-us_image_0269383960.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383960.png differ diff --git a/umn/source/_static/images/en-us_image_0269383961.png b/umn/source/_static/images/en-us_image_0269383961.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383961.png differ diff --git a/umn/source/_static/images/en-us_image_0269383962.png b/umn/source/_static/images/en-us_image_0269383962.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383962.png differ diff --git a/umn/source/_static/images/en-us_image_0269383963.png b/umn/source/_static/images/en-us_image_0269383963.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383963.png differ diff --git a/umn/source/_static/images/en-us_image_0269383964.png b/umn/source/_static/images/en-us_image_0269383964.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383964.png differ diff --git a/umn/source/_static/images/en-us_image_0269383966.png b/umn/source/_static/images/en-us_image_0269383966.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383966.png differ diff --git a/umn/source/_static/images/en-us_image_0269383967.png b/umn/source/_static/images/en-us_image_0269383967.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383967.png differ diff --git a/umn/source/_static/images/en-us_image_0269383968.png b/umn/source/_static/images/en-us_image_0269383968.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383968.png differ diff --git a/umn/source/_static/images/en-us_image_0269383969.png b/umn/source/_static/images/en-us_image_0269383969.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383969.png differ diff --git a/umn/source/_static/images/en-us_image_0269383970.png b/umn/source/_static/images/en-us_image_0269383970.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383970.png differ diff --git a/umn/source/_static/images/en-us_image_0269383972.png b/umn/source/_static/images/en-us_image_0269383972.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269383972.png differ diff --git a/umn/source/_static/images/en-us_image_0269417342.png b/umn/source/_static/images/en-us_image_0269417342.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417342.png differ diff --git a/umn/source/_static/images/en-us_image_0269417343.png b/umn/source/_static/images/en-us_image_0269417343.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417343.png differ diff --git a/umn/source/_static/images/en-us_image_0269417344.png b/umn/source/_static/images/en-us_image_0269417344.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417344.png differ diff --git a/umn/source/_static/images/en-us_image_0269417365.png b/umn/source/_static/images/en-us_image_0269417365.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417365.png differ diff --git a/umn/source/_static/images/en-us_image_0269417366.png b/umn/source/_static/images/en-us_image_0269417366.png new file mode 100644 index 0000000..cf2d1d5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417366.png differ diff --git a/umn/source/_static/images/en-us_image_0269417367.png b/umn/source/_static/images/en-us_image_0269417367.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417367.png differ diff --git a/umn/source/_static/images/en-us_image_0269417368.png b/umn/source/_static/images/en-us_image_0269417368.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417368.png differ diff --git a/umn/source/_static/images/en-us_image_0269417369.png b/umn/source/_static/images/en-us_image_0269417369.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417369.png differ diff --git a/umn/source/_static/images/en-us_image_0269417370.png b/umn/source/_static/images/en-us_image_0269417370.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417370.png differ diff --git a/umn/source/_static/images/en-us_image_0269417373.png b/umn/source/_static/images/en-us_image_0269417373.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417373.png differ diff --git a/umn/source/_static/images/en-us_image_0269417374.png b/umn/source/_static/images/en-us_image_0269417374.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417374.png differ diff --git a/umn/source/_static/images/en-us_image_0269417375.png b/umn/source/_static/images/en-us_image_0269417375.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417375.png differ diff --git a/umn/source/_static/images/en-us_image_0269417376.png b/umn/source/_static/images/en-us_image_0269417376.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417376.png differ diff --git a/umn/source/_static/images/en-us_image_0269417377.png b/umn/source/_static/images/en-us_image_0269417377.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417377.png differ diff --git a/umn/source/_static/images/en-us_image_0269417379.png b/umn/source/_static/images/en-us_image_0269417379.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417379.png differ diff --git a/umn/source/_static/images/en-us_image_0269417380.png b/umn/source/_static/images/en-us_image_0269417380.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417380.png differ diff --git a/umn/source/_static/images/en-us_image_0269417381.png b/umn/source/_static/images/en-us_image_0269417381.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417381.png differ diff --git a/umn/source/_static/images/en-us_image_0269417382.png b/umn/source/_static/images/en-us_image_0269417382.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417382.png differ diff --git a/umn/source/_static/images/en-us_image_0269417383.png b/umn/source/_static/images/en-us_image_0269417383.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417383.png differ diff --git a/umn/source/_static/images/en-us_image_0269417384.png b/umn/source/_static/images/en-us_image_0269417384.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417384.png differ diff --git a/umn/source/_static/images/en-us_image_0269417385.png b/umn/source/_static/images/en-us_image_0269417385.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417385.png differ diff --git a/umn/source/_static/images/en-us_image_0269417388.png b/umn/source/_static/images/en-us_image_0269417388.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417388.png differ diff --git a/umn/source/_static/images/en-us_image_0269417389.png b/umn/source/_static/images/en-us_image_0269417389.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417389.png differ diff --git a/umn/source/_static/images/en-us_image_0269417390.png b/umn/source/_static/images/en-us_image_0269417390.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417390.png differ diff --git a/umn/source/_static/images/en-us_image_0269417391.png b/umn/source/_static/images/en-us_image_0269417391.png new file mode 100644 index 0000000..ff19773 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417391.png differ diff --git a/umn/source/_static/images/en-us_image_0269417392.png b/umn/source/_static/images/en-us_image_0269417392.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417392.png differ diff --git a/umn/source/_static/images/en-us_image_0269417393.png b/umn/source/_static/images/en-us_image_0269417393.png new file mode 100644 index 0000000..a1abf6a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417393.png differ diff --git a/umn/source/_static/images/en-us_image_0269417394.png b/umn/source/_static/images/en-us_image_0269417394.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417394.png differ diff --git a/umn/source/_static/images/en-us_image_0269417395.png b/umn/source/_static/images/en-us_image_0269417395.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417395.png differ diff --git a/umn/source/_static/images/en-us_image_0269417396.png b/umn/source/_static/images/en-us_image_0269417396.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417396.png differ diff --git a/umn/source/_static/images/en-us_image_0269417397.png b/umn/source/_static/images/en-us_image_0269417397.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417397.png differ diff --git a/umn/source/_static/images/en-us_image_0269417398.png b/umn/source/_static/images/en-us_image_0269417398.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417398.png differ diff --git a/umn/source/_static/images/en-us_image_0269417399.png b/umn/source/_static/images/en-us_image_0269417399.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417399.png differ diff --git a/umn/source/_static/images/en-us_image_0269417400.png b/umn/source/_static/images/en-us_image_0269417400.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417400.png differ diff --git a/umn/source/_static/images/en-us_image_0269417401.png b/umn/source/_static/images/en-us_image_0269417401.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417401.png differ diff --git a/umn/source/_static/images/en-us_image_0269417402.png b/umn/source/_static/images/en-us_image_0269417402.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417402.png differ diff --git a/umn/source/_static/images/en-us_image_0269417403.png b/umn/source/_static/images/en-us_image_0269417403.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417403.png differ diff --git a/umn/source/_static/images/en-us_image_0269417404.png b/umn/source/_static/images/en-us_image_0269417404.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417404.png differ diff --git a/umn/source/_static/images/en-us_image_0269417405.png b/umn/source/_static/images/en-us_image_0269417405.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417405.png differ diff --git a/umn/source/_static/images/en-us_image_0269417406.png b/umn/source/_static/images/en-us_image_0269417406.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417406.png differ diff --git a/umn/source/_static/images/en-us_image_0269417409.png b/umn/source/_static/images/en-us_image_0269417409.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417409.png differ diff --git a/umn/source/_static/images/en-us_image_0269417410.png b/umn/source/_static/images/en-us_image_0269417410.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417410.png differ diff --git a/umn/source/_static/images/en-us_image_0269417413.png b/umn/source/_static/images/en-us_image_0269417413.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417413.png differ diff --git a/umn/source/_static/images/en-us_image_0269417414.png b/umn/source/_static/images/en-us_image_0269417414.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417414.png differ diff --git a/umn/source/_static/images/en-us_image_0269417415.png b/umn/source/_static/images/en-us_image_0269417415.png new file mode 100644 index 0000000..d6a0d7a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417415.png differ diff --git a/umn/source/_static/images/en-us_image_0269417416.png b/umn/source/_static/images/en-us_image_0269417416.png new file mode 100644 index 0000000..955f1b6 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417416.png differ diff --git a/umn/source/_static/images/en-us_image_0269417417.png b/umn/source/_static/images/en-us_image_0269417417.png new file mode 100644 index 0000000..08b0ca5 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417417.png differ diff --git a/umn/source/_static/images/en-us_image_0269417418.png b/umn/source/_static/images/en-us_image_0269417418.png new file mode 100644 index 0000000..d52dbee Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417418.png differ diff --git a/umn/source/_static/images/en-us_image_0269417419.png b/umn/source/_static/images/en-us_image_0269417419.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417419.png differ diff --git a/umn/source/_static/images/en-us_image_0269417420.png b/umn/source/_static/images/en-us_image_0269417420.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417420.png differ diff --git a/umn/source/_static/images/en-us_image_0269417421.png b/umn/source/_static/images/en-us_image_0269417421.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417421.png differ diff --git a/umn/source/_static/images/en-us_image_0269417422.png b/umn/source/_static/images/en-us_image_0269417422.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417422.png differ diff --git a/umn/source/_static/images/en-us_image_0269417423.png b/umn/source/_static/images/en-us_image_0269417423.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417423.png differ diff --git a/umn/source/_static/images/en-us_image_0269417427.png b/umn/source/_static/images/en-us_image_0269417427.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417427.png differ diff --git a/umn/source/_static/images/en-us_image_0269417428.png b/umn/source/_static/images/en-us_image_0269417428.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417428.png differ diff --git a/umn/source/_static/images/en-us_image_0269417429.png b/umn/source/_static/images/en-us_image_0269417429.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417429.png differ diff --git a/umn/source/_static/images/en-us_image_0269417439.png b/umn/source/_static/images/en-us_image_0269417439.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417439.png differ diff --git a/umn/source/_static/images/en-us_image_0269417447.png b/umn/source/_static/images/en-us_image_0269417447.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417447.png differ diff --git a/umn/source/_static/images/en-us_image_0269417449.png b/umn/source/_static/images/en-us_image_0269417449.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417449.png differ diff --git a/umn/source/_static/images/en-us_image_0269417455.png b/umn/source/_static/images/en-us_image_0269417455.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417455.png differ diff --git a/umn/source/_static/images/en-us_image_0269417456.png b/umn/source/_static/images/en-us_image_0269417456.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417456.png differ diff --git a/umn/source/_static/images/en-us_image_0269417458.png b/umn/source/_static/images/en-us_image_0269417458.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417458.png differ diff --git a/umn/source/_static/images/en-us_image_0269417459.png b/umn/source/_static/images/en-us_image_0269417459.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417459.png differ diff --git a/umn/source/_static/images/en-us_image_0269417460.png b/umn/source/_static/images/en-us_image_0269417460.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417460.png differ diff --git a/umn/source/_static/images/en-us_image_0269417461.png b/umn/source/_static/images/en-us_image_0269417461.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417461.png differ diff --git a/umn/source/_static/images/en-us_image_0269417462.png b/umn/source/_static/images/en-us_image_0269417462.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417462.png differ diff --git a/umn/source/_static/images/en-us_image_0269417463.png b/umn/source/_static/images/en-us_image_0269417463.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417463.png differ diff --git a/umn/source/_static/images/en-us_image_0269417464.png b/umn/source/_static/images/en-us_image_0269417464.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417464.png differ diff --git a/umn/source/_static/images/en-us_image_0269417465.png b/umn/source/_static/images/en-us_image_0269417465.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417465.png differ diff --git a/umn/source/_static/images/en-us_image_0269417466.png b/umn/source/_static/images/en-us_image_0269417466.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417466.png differ diff --git a/umn/source/_static/images/en-us_image_0269417468.png b/umn/source/_static/images/en-us_image_0269417468.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417468.png differ diff --git a/umn/source/_static/images/en-us_image_0269417469.png b/umn/source/_static/images/en-us_image_0269417469.png new file mode 100644 index 0000000..10c6287 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417469.png differ diff --git a/umn/source/_static/images/en-us_image_0269417473.png b/umn/source/_static/images/en-us_image_0269417473.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417473.png differ diff --git a/umn/source/_static/images/en-us_image_0269417499.png b/umn/source/_static/images/en-us_image_0269417499.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417499.png differ diff --git a/umn/source/_static/images/en-us_image_0269417500.png b/umn/source/_static/images/en-us_image_0269417500.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417500.png differ diff --git a/umn/source/_static/images/en-us_image_0269417501.png b/umn/source/_static/images/en-us_image_0269417501.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417501.png differ diff --git a/umn/source/_static/images/en-us_image_0269417502.png b/umn/source/_static/images/en-us_image_0269417502.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417502.png differ diff --git a/umn/source/_static/images/en-us_image_0269417503.png b/umn/source/_static/images/en-us_image_0269417503.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417503.png differ diff --git a/umn/source/_static/images/en-us_image_0269417504.png b/umn/source/_static/images/en-us_image_0269417504.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417504.png differ diff --git a/umn/source/_static/images/en-us_image_0269417505.png b/umn/source/_static/images/en-us_image_0269417505.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417505.png differ diff --git a/umn/source/_static/images/en-us_image_0269417506.png b/umn/source/_static/images/en-us_image_0269417506.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417506.png differ diff --git a/umn/source/_static/images/en-us_image_0269417510.png b/umn/source/_static/images/en-us_image_0269417510.png new file mode 100644 index 0000000..d6da780 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417510.png differ diff --git a/umn/source/_static/images/en-us_image_0269417534.png b/umn/source/_static/images/en-us_image_0269417534.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417534.png differ diff --git a/umn/source/_static/images/en-us_image_0269417535.png b/umn/source/_static/images/en-us_image_0269417535.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417535.png differ diff --git a/umn/source/_static/images/en-us_image_0269417537.png b/umn/source/_static/images/en-us_image_0269417537.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417537.png differ diff --git a/umn/source/_static/images/en-us_image_0269417538.png b/umn/source/_static/images/en-us_image_0269417538.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417538.png differ diff --git a/umn/source/_static/images/en-us_image_0269417539.png b/umn/source/_static/images/en-us_image_0269417539.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417539.png differ diff --git a/umn/source/_static/images/en-us_image_0269417540.png b/umn/source/_static/images/en-us_image_0269417540.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417540.png differ diff --git a/umn/source/_static/images/en-us_image_0269417541.png b/umn/source/_static/images/en-us_image_0269417541.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417541.png differ diff --git a/umn/source/_static/images/en-us_image_0269417542.png b/umn/source/_static/images/en-us_image_0269417542.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417542.png differ diff --git a/umn/source/_static/images/en-us_image_0269417543.gif b/umn/source/_static/images/en-us_image_0269417543.gif new file mode 100644 index 0000000..7242870 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417543.gif differ diff --git a/umn/source/_static/images/en-us_image_0269417544.gif b/umn/source/_static/images/en-us_image_0269417544.gif new file mode 100644 index 0000000..7242870 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417544.gif differ diff --git a/umn/source/_static/images/en-us_image_0269417545.png b/umn/source/_static/images/en-us_image_0269417545.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417545.png differ diff --git a/umn/source/_static/images/en-us_image_0269417546.png b/umn/source/_static/images/en-us_image_0269417546.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417546.png differ diff --git a/umn/source/_static/images/en-us_image_0269417547.png b/umn/source/_static/images/en-us_image_0269417547.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417547.png differ diff --git a/umn/source/_static/images/en-us_image_0269417548.png b/umn/source/_static/images/en-us_image_0269417548.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417548.png differ diff --git a/umn/source/_static/images/en-us_image_0269417549.png b/umn/source/_static/images/en-us_image_0269417549.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269417549.png differ diff --git a/umn/source/_static/images/en-us_image_0269623978.png b/umn/source/_static/images/en-us_image_0269623978.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269623978.png differ diff --git a/umn/source/_static/images/en-us_image_0269624001.png b/umn/source/_static/images/en-us_image_0269624001.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0269624001.png differ diff --git a/umn/source/_static/images/en-us_image_0270938821.png b/umn/source/_static/images/en-us_image_0270938821.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0270938821.png differ diff --git a/umn/source/_static/images/en-us_image_0276137853.png b/umn/source/_static/images/en-us_image_0276137853.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0276137853.png differ diff --git a/umn/source/_static/images/en-us_image_0276137857.png b/umn/source/_static/images/en-us_image_0276137857.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0276137857.png differ diff --git a/umn/source/_static/images/en-us_image_0276137858.png b/umn/source/_static/images/en-us_image_0276137858.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0276137858.png differ diff --git a/umn/source/_static/images/en-us_image_0276137859.png b/umn/source/_static/images/en-us_image_0276137859.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0276137859.png differ diff --git a/umn/source/_static/images/en-us_image_0276801805.png b/umn/source/_static/images/en-us_image_0276801805.png new file mode 100644 index 0000000..70d749f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0276801805.png differ diff --git a/umn/source/_static/images/en-us_image_0278119935.png b/umn/source/_static/images/en-us_image_0278119935.png new file mode 100644 index 0000000..b6496e0 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0278119935.png differ diff --git a/umn/source/_static/images/en-us_image_0279536633.png b/umn/source/_static/images/en-us_image_0279536633.png new file mode 100644 index 0000000..69c940a Binary files /dev/null and b/umn/source/_static/images/en-us_image_0279536633.png differ diff --git a/umn/source/_static/images/en-us_image_0293101307.png b/umn/source/_static/images/en-us_image_0293101307.png new file mode 100644 index 0000000..c991b63 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293101307.png differ diff --git a/umn/source/_static/images/en-us_image_0293131436.png b/umn/source/_static/images/en-us_image_0293131436.png new file mode 100644 index 0000000..d6c0b6e Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293131436.png differ diff --git a/umn/source/_static/images/en-us_image_0293180907.png b/umn/source/_static/images/en-us_image_0293180907.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293180907.png differ diff --git a/umn/source/_static/images/en-us_image_0293234930.png b/umn/source/_static/images/en-us_image_0293234930.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293234930.png differ diff --git a/umn/source/_static/images/en-us_image_0293235730.png b/umn/source/_static/images/en-us_image_0293235730.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293235730.png differ diff --git a/umn/source/_static/images/en-us_image_0293242788.png b/umn/source/_static/images/en-us_image_0293242788.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293242788.png differ diff --git a/umn/source/_static/images/en-us_image_0293245149.png b/umn/source/_static/images/en-us_image_0293245149.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293245149.png differ diff --git a/umn/source/_static/images/en-us_image_0293246465.png b/umn/source/_static/images/en-us_image_0293246465.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293246465.png differ diff --git a/umn/source/_static/images/en-us_image_0293246731.png b/umn/source/_static/images/en-us_image_0293246731.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293246731.png differ diff --git a/umn/source/_static/images/en-us_image_0293247048.png b/umn/source/_static/images/en-us_image_0293247048.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293247048.png differ diff --git a/umn/source/_static/images/en-us_image_0293247437.png b/umn/source/_static/images/en-us_image_0293247437.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293247437.png differ diff --git a/umn/source/_static/images/en-us_image_0293267262.png b/umn/source/_static/images/en-us_image_0293267262.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293267262.png differ diff --git a/umn/source/_static/images/en-us_image_0293268152.png b/umn/source/_static/images/en-us_image_0293268152.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293268152.png differ diff --git a/umn/source/_static/images/en-us_image_0293269028.png b/umn/source/_static/images/en-us_image_0293269028.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293269028.png differ diff --git a/umn/source/_static/images/en-us_image_0293269047.png b/umn/source/_static/images/en-us_image_0293269047.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293269047.png differ diff --git a/umn/source/_static/images/en-us_image_0293269551.png b/umn/source/_static/images/en-us_image_0293269551.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0293269551.png differ diff --git a/umn/source/_static/images/en-us_image_0295310731.png b/umn/source/_static/images/en-us_image_0295310731.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0295310731.png differ diff --git a/umn/source/_static/images/en-us_image_0295554634.png b/umn/source/_static/images/en-us_image_0295554634.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0295554634.png differ diff --git a/umn/source/_static/images/en-us_image_0295706662.png b/umn/source/_static/images/en-us_image_0295706662.png new file mode 100644 index 0000000..61ba034 Binary files /dev/null and b/umn/source/_static/images/en-us_image_0295706662.png differ diff --git a/umn/source/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst b/umn/source/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst new file mode 100644 index 0000000..fbde94d --- /dev/null +++ b/umn/source/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst @@ -0,0 +1,107 @@ +:original_name: mrs_01_0129.html + +.. _mrs_01_0129: + +Accessing FusionInsight Manager (MRS 3.\ *x* or Later) +====================================================== + +Scenario +-------- + +In MRS 3.\ *x* or later, FusionInsight Manager is used to monitor, configure, and manage clusters. After the cluster is installed, you can use the account to log in to FusionInsight Manager. + +.. note:: + + If you cannot log in to the WebUI of the component, access FusionInsight Manager by referring to :ref:`Accessing FusionInsight Manager from an ECS `. + +Accessing FusionInsight Manager Using EIP +----------------------------------------- + +#. Log in to the MRS management console. + +#. In the navigation pane, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Manager** next to **MRS Manager**. In the displayed dialog box, configure the EIP information. + + a. If no EIP is bound during MRS cluster creation, select an available EIP from the drop-down list on the right of **IEP**. If you have bound an EIP when creating a cluster, go to :ref:`3.b `. + + .. note:: + + If no EIP is available, click **Manage EIP** to create one. Then, select the created EIP from the drop-down list on the right of **EIP**. + + b. .. _mrs_01_0129__li59591846143810: + + Select the security group to which the security group rule to be added belongs. The security group is configured when the cluster is created. + + c. Add a security group rule. By default, the filled-in rule is used to access the EIP. To enable multiple IP address segments to access Manager, see steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + d. Select the information to be confirmed and click **OK**. + +#. Click **OK**. The Manager login page is displayed. + +#. Enter the default username **admin** and the password set during cluster creation, and click **Log In**. The Manager page is displayed. + +#. .. _mrs_01_0129__en-us_topic_0035209594_li1049410469610: + + On the MRS management console, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + + .. note:: + + To grant other users the permission to access Manager, perform :ref:`6 ` to :ref:`9 ` to add the users' public IP addresses to the trusted IP address range. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select **I confirm that public network IP/port is a trusted public IP address. I understand that using 0.0.0.0/0. poses security risks**. + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_0129__en-us_topic_0035209594_li035723593115: + + Click **OK**. + +.. _mrs_01_0129__section20880102283115: + +Accessing FusionInsight Manager from an ECS +------------------------------------------- + +#. On the MRS management console, click **Clusters**. + +#. On the **Active Clusters** page, click the name of the specified cluster. + + Record the **AZ**, **VPC**, **MRS Manager**\ **Security Group** of the cluster. + +#. On the homepage of the management console, choose **Service List** > **Elastic Cloud Server** to switch to the ECS management console and create an ECS. + + - The **AZ**, **VPC**, and **Security Group** of the ECS must be the same as those of the cluster to be accessed. + - Select a Windows public image. For example, a standard image **Windows Server 2012 R2 Standard 64bit(40GB)**. + - For details about other configuration parameters, see **Elastic Cloud Server > User Guide > Getting Started > Creating and Logging In to a Windows ECS**. + + .. note:: + + If the security group of the ECS is different from **Default Security Group** of the Master node, you can modify the configuration using either of the following methods: + + - Change the security group of the ECS to the default security group of the Master node. For details, see **Elastic Cloud Server** > **User Guide** > **Security Group** > **Changing a Security Group**. + - Add two security group rules to the security groups of the Master and Core nodes to enable the ECS to access the cluster. Set **Protocol** to **TCP**, **Ports** of the two security group rules to **28443** and **20009**, respectively. For details, see **Virtual Private Cloud > User Guide > Security > Security Group > Adding a Security Group Rule**. + +#. On the VPC management console, apply for an EIP and bind it to the ECS. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Log in to the ECS. + + The Windows system account, password, EIP, and the security group rules are required for logging in to the ECS. For details, see **Elastic Cloud Server > User Guide > Instances > Logging In to a Windows ECS**. + +#. On the Windows remote desktop, use your browser to access Manager. + + For example, you can use Internet Explorer 11 in the Windows 2012 OS. + + The address for accessing Manager is the address of the **MRS Manager** page. Enter the name and password of the cluster user, for example, user **admin**. + + .. note:: + + - If you access Manager with other cluster usernames, change the password upon your first access. The new password must meet the requirements of the current password complexity policies. For details, contact the administrator. + - By default, a user is locked after inputting an incorrect password five consecutive times. The user is automatically unlocked after 5 minutes. + +#. Log out of FusionInsight Manager. To log out of Manager, move the cursor to |image1| in the upper right corner and click **Log Out**. + +.. |image1| image:: /_static/images/en-us_image_0000001349257413.jpg diff --git a/umn/source/accessing_manager/accessing_mrs_manager_mrs_2.1.0_or_earlier.rst b/umn/source/accessing_manager/accessing_mrs_manager_mrs_2.1.0_or_earlier.rst new file mode 100644 index 0000000..9aa5085 --- /dev/null +++ b/umn/source/accessing_manager/accessing_mrs_manager_mrs_2.1.0_or_earlier.rst @@ -0,0 +1,150 @@ +:original_name: mrs_01_0102.html + +.. _mrs_01_0102: + +Accessing MRS Manager MRS 2.1.0 or Earlier) +=========================================== + +Scenario +-------- + +MRS uses FusionInsight Manager to monitor, configure, and manage clusters. You can access FusionInsight Manager by clicking **Access Manager** on the **Dashboard** tab page of your MRS cluster and entering username **admin** and the password configured during cluster creation on the login page that is displayed. + +.. _mrs_01_0102__en-us_topic_0035209594_section1511920110246: + +Accessing FusionInsight Manager Using an EIP +-------------------------------------------- + +#. Log in to the MRS management console. + +#. In the navigation pane, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Access Manager** next to **MRS Manager**. In the displayed dialog box, set **Access Mode** to **EIP**. For details about **Direct Connect**, see :ref:`Access Through Direct Connect `. + + a. If no EIP is bound during MRS cluster creation, select an available EIP from the drop-down list on the right of **IEP**. If you have bound an EIP when creating a cluster, go to :ref:`3.b `. + + .. note:: + + If no EIP is available, click **Manage EIP** to create one. Then, select the created EIP from the drop-down list on the right of **EIP**. + + b. .. _mrs_01_0102__li1362714487427: + + Select the security group to which the security group rule to be added belongs. The security group is configured when the cluster is created. + + c. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. To enable multiple IP address segments to access MRS Manager, see :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from the local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission of port 9022 to access Knox for accessing MRS Manager. + + d. Select the checkbox stating that **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address**. + +#. Click **OK**. The MRS Manager login page is displayed. + +#. Enter the default username **admin** and the password set during cluster creation, and click **Log In**. The MRS Manager page is displayed. + +#. .. _mrs_01_0102__en-us_topic_0035209594_li1049410469610: + + On the MRS management console, choose **Clusters** > **Active Clusters**, and click the target cluster name to access the cluster details page. + + .. note:: + + To assign MRS Manager access permissions to other users, follow instructions from :ref:`6 ` to :ref:`9 ` to add the users' public IP addresses to the trusted range. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select **I confirm that the authorized object is a trusted public IP address range. Do not use 0.0.0.0/0. Otherwise, security risks may arise**. + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_0102__en-us_topic_0035209594_li035723593115: + + Click **OK**. + +Accessing MRS Manager Using an ECS +---------------------------------- + +#. On the MRS management console, click **Clusters**. + +#. On the **Active Clusters** page, click the name of the specified cluster. + + Record the **AZ**, **VPC**, and **Security Group** of the cluster. + +#. On the ECS management console, create an ECS. + + - The **AZ**, **VPC**, and **Security Group** of the ECS must be the same as those of the cluster to be accessed. + - Select a Windows public image. For example, select the enterprise image **Enterprise_Windows_STD_2012R2_20170316-0(80GB)**. + - For details about other configuration parameters, see **Elastic Cloud Server > User Guide > Getting Started > Creating and Logging In to a Windows ECS**. + + .. note:: + + If the security group of the ECS is different from **Default Security Group** of the MRS cluster, you can modify the configuration using either of the following methods: + + - Change the default security group of the ECS to the security group of the MRS cluster. For details, see **Elastic Cloud Server** > **User Guide** > **Security Group** > **Changing a Security Group**. + - Add two security group rules to the security groups of the Master and Core nodes to enable the ECS to access the cluster. Set **Protocol** to **TCP** and **ports** of the two security group rules to **28443** and **20009**, respectively. For details, see **Virtual Private Cloud** > **User Guide** > **Security** > **Security Group** > **Adding a Security Group Rule**. + +#. On the VPC management console, apply for an EIP and bind it to the ECS. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Log in to the ECS. + + The Windows system account, password, EIP, and the security group rules are required for logging in to the ECS. For details, see **Elastic Cloud Server > User Guide > Instances > Logging In to a Windows ECS**. + +#. On the Windows remote desktop, use your browser to access Manager. + + For example, you can use Internet Explorer 11 in the Windows 2012 OS. + + The Manager access address is in the format of **https://**\ *Cluster Manager IP Address*\ **:28443/web**. Enter the name and password of the MRS cluster user, for example, user **admin**. + + .. note:: + + - To obtain the cluster manager IP address, remotely log in to the Master2 node, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the cluster manager IP address. Record the value of **inet**. If the cluster manager IP address cannot be queried on the Master2 node, switch to the Master1 node to query and record the cluster manager IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. + - If you access MRS Manager with other MRS cluster usernames, change the password upon your first access. The new password must meet the requirements of the current password complexity policies. + - By default, a user is locked after inputting an incorrect password five consecutive times. The user is automatically unlocked after 5 minutes. + +#. Log out of FusionInsight Manager. To log out of Manager, move the cursor to |image1| in the upper right corner and click **Log Out**. + +Changing an EIP for a Cluster +----------------------------- + +#. On the MRS management console, choose **Clusters** > **Active Clusters**, and click the target cluster name to access the cluster details page. + +#. View EIPs + +#. Log in to the VPC management console. + +#. Choose **Elastic IP and Bandwidth** > **EIPs**. + +#. Search for the EIP bound to the MRS cluster and click **Unbind** in the **Operation** column to unbind the EIP from the MRS cluster. + + |image2| + +#. Log in to the MRS management console, choose **Clusters** > **Active Clusters**, and click the target cluster name to access the cluster details page. + + **EIP** on the cluster details page is displayed as **Unbound**. + +#. Click **Access Manager** next to **MRS Manager**. In the displayed dialog box, set **Access Mode** to **EIP**. + +#. Select a new EIP from the EIP drop-down list and configure other parameters. For details, see :ref:`Accessing FusionInsight Manager Using an EIP `. + +Granting the Permission to Access MRS Manager to Other Users +------------------------------------------------------------ + +#. .. _mrs_01_0102__li1750491811399: + + On the MRS management console, choose **Clusters** > **Active Clusters**, and click the target cluster name to access the cluster details page. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select **I confirm that the authorized object is a trusted public IP address range. Do not use 0.0.0.0/0. Otherwise, security risks may arise**. + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`1 ` to :ref:`4 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_0102__li55051218183912: + + Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001349137801.jpg +.. |image2| image:: /_static/images/en-us_image_0000001390878044.png diff --git a/umn/source/accessing_manager/index.rst b/umn/source/accessing_manager/index.rst new file mode 100644 index 0000000..50b9895 --- /dev/null +++ b/umn/source/accessing_manager/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0128.html + +.. _mrs_01_0128: + +Accessing Manager +================= + +- :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) ` +- :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + accessing_fusioninsight_manager_mrs_3.x_or_later + accessing_mrs_manager_mrs_2.1.0_or_earlier diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_through_direct_connect.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_through_direct_connect.rst new file mode 100644 index 0000000..d81520f --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_through_direct_connect.rst @@ -0,0 +1,43 @@ +:original_name: mrs_01_0645.html + +.. _mrs_01_0645: + +Access Through Direct Connect +============================= + +MRS allows you to access MRS clusters using Direct Connect. Direct Connect is a high-speed, low-latency, stable, and secure dedicated network connection that connects your local data center to an online cloud VPC. It extends online cloud services and existing IT facilities to build a flexible, scalable hybrid cloud computing environment. + +Prerequisites +------------- + +Direct Connect is available, and the connection between the local data center and the online VPC has been established. + +Accessing an MRS Cluster Using Direct Connect +--------------------------------------------- + +#. Log in to the MRS console. + +#. Click the name of the cluster to enter its details page. + +#. On the **Dashboard** tab page of the cluster details page, click **Access Manager** next to **MRS Manager**. + +#. Set **Access Mode** to **Direct Connect** and select **I confirm that the network between the local PC and the floating IP address is connected and that MRS Manager is accessible using the Direct Connect connection**. + + The floating IP address is automatically allocated by MRS to access MRS Manager. Before using Direct Connect to access MRS Manager, ensure that the connection between the local data center and the online VPC has been established. + +#. Click **OK**. The MRS Manager login page is displayed. Enter the username **admin** and the password set during cluster creation. + +Switching the MRS Manager Access Mode +------------------------------------- + +To facilitate user operations, the browser cache records the selected Manager access mode. To change the access mode, perform the following steps: + +#. Log in to the MRS console. +#. Click the name of the cluster to enter its details page. +#. On the **Dashboard** tab page of the cluster details page, click |image1| next to **MRS Manager**. +#. On the displayed page, set **Access Mode**. + + - To change **EIP** to **Direct Connect**, ensure that the network for direct connections is available, set **Access Mode** to **Direct Connect**, and select **I confirm that the network between the local PC and the floating IP address is connected and that MRS Manager is accessible using the Direct Connect connection**. Click **OK**. + - To change **Direct Connect** to **EIP**, set **Access Mode** to **EIP** and configure the EIP by referring to :ref:`Accessing FusionInsight Manager Using an EIP `. If a public IP address has been configured for the cluster, click **OK** to access MRS Manager using an EIP. + +.. |image1| image:: /_static/images/en-us_image_0000001295738236.png diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_using_a_windows_ecs.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_using_a_windows_ecs.rst new file mode 100644 index 0000000..2654bc5 --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/access_using_a_windows_ecs.rst @@ -0,0 +1,76 @@ +:original_name: mrs_01_0647.html + +.. _mrs_01_0647: + +Access Using a Windows ECS +========================== + +MRS allows you to access the web UIs of open-source components through a Windows ECS. This method is complex and is recommended for MRS clusters that do not support the EIP function (security clusters with Kerberos authentication enabled in versions earlier than MRS 1.9.2). + +#. On the MRS management console, click **Clusters**. + +#. On the **Active Clusters** page, click the name of the specified cluster. + + On the cluster details page, record the **AZ**, **VPC**, **Cluster Manager IP Address**, and **Security Group** of the cluster. + + .. note:: + + To obtain the cluster manager IP address, remotely log in to the Master2 node, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the cluster manager IP address. Record the value of **inet**. If the cluster manager IP address cannot be queried on the Master2 node, switch to the Master1 node to query and record the cluster manager IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. + +#. On the ECS management console, create an ECS. + + - The **AZ**, **VPC**, and **Security Group** of the ECS must be the same as those of the cluster to be accessed. + - Select a Windows public image. For example, select the enterprise image **Enterprise_Windows_STD_2012R2_20170316-0(80GB)**. + - For details about other configuration parameters, see **Elastic Cloud Server > User Guide > Getting Started > Creating and Logging In to a Windows ECS**. + + .. note:: + + If the security group of the ECS is different from **Security Group** of the MRS cluster, you can modify the configuration using either of the following methods: + + - Change the security group of the ECS to the security group of the MRS cluster. For details, see **Elastic Cloud Server** > **User Guide** > **Security Group** > **Changing a Security Group**. + - Add two security group rules to the security groups of the Master and Core nodes to enable the ECS to access the cluster. Set **Protocol** to **TCP** and **ports** of the two security group rules to **28443** and **20009**, respectively. For details, see **Virtual Private Cloud** > **User Guide** > **Security** > **Security Group** > **Adding a Security Group Rule**. + +#. On the VPC management console, apply for an EIP and bind it to the ECS. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Log in to the ECS. + + The Windows system account, password, EIP, and the security group rules are required for logging in to the ECS. For details, see **Elastic Cloud Server > User Guide > Instances > Logging In to a Windows ECS**. + +#. On the Windows remote desktop, use your browser to access Manager. + + For example, you can use Internet Explorer 11 in the Windows 2012 OS. + + The MRS Manager access address is in the format of **https://Cluster Manager IP Address:28443/web**. Enter the name and password of the MRS cluster user, for example, user **admin**. + + .. note:: + + - To obtain the cluster manager IP address, remotely log in to the Master2 node, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the cluster manager IP address. Record the value of **inet**. If the cluster manager IP address cannot be queried on the Master2 node, switch to the Master1 node to query and record the cluster manager IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. + - If you access MRS Manager with other MRS cluster usernames, change the password upon your first access. The new password must meet the requirements of the current password complexity policies. + - By default, a user is locked after inputting an incorrect password five consecutive times. The user is automatically unlocked after 5 minutes. + +#. Visit the web UIs of the open-source components by referring to the addresses listed in :ref:`Web UIs of Open Source Components `. + +Related Tasks +------------- + +**Configuring the Mapping Between Cluster Node Names and IP Addresses** + +#. Log in to MRS Manager, and choose **Host Management**. + + Record the host names and management IP addresses of all nodes in the cluster. + +#. In the work environment, use Notepad to open the **hosts** file and add the mapping between node names and IP addresses to the file. + + Fill in one row for each mapping relationship, as shown in the following figure. + + .. code-block:: + + 192.168.4.127 node-core-Jh3ER + 192.168.4.225 node-master2-PaWVE + 192.168.4.19 node-core-mtZ81 + 192.168.4.33 node-master1-zbYN8 + 192.168.4.233 node-core-7KoGY + + Save the modifications. diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/creating_an_ssh_channel_for_connecting_to_an_mrs_cluster_and_configuring_the_browser.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/creating_an_ssh_channel_for_connecting_to_an_mrs_cluster_and_configuring_the_browser.rst new file mode 100644 index 0000000..79a979e --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/creating_an_ssh_channel_for_connecting_to_an_mrs_cluster_and_configuring_the_browser.rst @@ -0,0 +1,138 @@ +:original_name: mrs_01_0363.html + +.. _mrs_01_0363: + +Creating an SSH Channel for Connecting to an MRS Cluster and Configuring the Browser +==================================================================================== + +Scenario +-------- + +Users and an MRS cluster are in different networks. As a result, an SSH channel needs to be created to send users' requests for accessing websites to the MRS cluster and dynamically forward them to the target websites. + +The MAC system does not support this function. For details about how to access MRS, see :ref:`EIP-based Access `. + +Prerequisites +------------- + +- You have prepared an SSH client for creating the SSH channel, for example, the Git open-source SSH client. You have downloaded and installed the client. +- You have created a cluster and prepared a key file in PEM format. +- Users can access the Internet on the local PC. + +Procedure +--------- + +#. Log in to the MRS management console and choose **Clusters** > **Active Clusters**. + +#. Click the specified MRS cluster name. + + Record the security group of the cluster. + +#. Add an inbound rule to the security group of the Master node to allow data access to the IP address of the MRS cluster through port **22**. + + For details, see **Virtual Private Cloud** > **User Guide** > **Security** > **Security Group** > **Adding a Security Group Rule**. + +#. Query the primary management node of the cluster. For details, see :ref:`Determining Active and Standby Management Nodes of Manager `. + +#. Bind an elastic IP address to the primary management node. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Start Git Bash locally and run the following command to log in to the active management node of the cluster: **ssh root@\ Elastic IP address** or **ssh -i** **Path of the key file** **root@**\ **Elastic IP address**. + +#. Run the following command to view data forwarding configurations: + + **cat /etc/sysctl.conf \| grep net.ipv4.ip_forward** + + - If **net.ipv4.ip_forward=1** is displayed, the forwarding function has been configured. Go to :ref:`9 `. + + - If **net.ipv4.ip_forward=0** is displayed, the forwarding function has not been configured. Go to :ref:`8 `. + + - If **net.ipv4.ip_forward** fails to be queried, this parameter has not been configured. Run the following command and then go to :ref:`9 `: + + echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf + +#. .. _mrs_01_0363__l116ba5c37fe940bc8218c6e3989bfa2a: + + Modify forwarding configurations on the node. + + a. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the following commands to modify forwarding configurations: + + **echo 1 > /proc/sys/net/ipv4/ip_forward** + + **sed -i "s/net.ipv4.ip_forward=0/net.ipv4.ip_forward = 1/g" /etc/sysctl.conf** + + **sysctl -w net.ipv4.ip_forward=1** + + c. Run the following command to modify the **sshd** configuration file: + + **vi /etc/ssh/sshd_config** + + Press **I** to enter the edit mode. Locate **AllowTcpForwarding** and **GatewayPorts** and delete comment tags. Modify them as follows. Save the changes and exit. + + .. code-block:: + + AllowTcpForwarding yes + GatewayPorts yes + + d. Run the following command to restart the sshd service: + + **service sshd restart** + +#. .. _mrs_01_0363__l443dc566475c459399c4e15787485276: + + Run the following command to view the floating IP address: + + **ifconfig** + + In the command output, **eth0:FI_HUE** indicates the floating IP address of Hue and **eth0:wsom** specifies the floating IP address of Manager. Record the value of **inet**. + + Run the **exit** command to exit. + +#. .. _mrs_01_0363__lcf3e5d4b24e645bdbb9a0100a0dca09d: + + Run the following command on the local PC to create an SSH channel supporting dynamic port forwarding: + + **ssh -i** **Path of the key file -v -ND** **Local port** **root@\ Elastic IP address** or **ssh -v -ND Local port root@Elastic IP address**. After running the command, enter the password you set when you create the cluster. + + In the command, set **Local port** to the user's local port that is not occupied. Port **8157** is recommended. + + After the SSH channel is created, add **-D** to the command and run the command to start the dynamic port forwarding function. By default, the dynamic port forwarding function enables a SOCKS proxy process and monitors the user's local port. Port data will be forwarded to the primary management node using the SSH channel. + +#. Run the following command to configure the browser proxy. + + a. Go to the Google Chrome client installation directory on the local PC. + + b. .. _mrs_01_0363__l2d621fbf73a04b28a135923e3a74a4f3: + + Press **Shift** and right-click the blank area, choose **Open Command Window Here** and enter the following command: + + **chrome --proxy-server="socks5://localhost:8157" --host-resolver-rules="MAP \* 0.0.0.0 , EXCLUDE localhost" --user-data-dir=c:/tmppath --proxy-bypass-list="*google*com,*gstatic.com,*gvt*.com,*:80"** + + .. note:: + + - In the preceding command, **8157** is the local proxy port configured in :ref:`10 `. + - If the local OS is Windows 10, start the Windows OS, click **Start** and enter **cmd**. In the displayed CLI, run the command in :ref:`11.b `. If this method fails, click **Start**, enter the command in the search box, and run the command in :ref:`11.b `. + +#. In the address box of the browser, enter the address for accessing Manager. + + Address format: **https://**\ *Floating IP address of FusionInsight Manager*\ **:28443/web** + + The username and password of the MRS cluster need to be entered for accessing clusters with Kerberos authentication enabled, for example, user **admin**. They are not required for accessing clusters with Kerberos authentication disabled. + + When accessing Manager for the first time, you must add the address to the trusted site list. + +#. Prepare the website access address. + + a. Obtain the website address format and the role instance according to :ref:`Web UIs `. + b. Click **Services**. + c. Click the specified service name, for example, HDFS. + d. Click **Instance** and view **Service IP Address** of **NameNode(Active)**. + +#. In the address bar of the browser, enter the website address to access it. + +#. When logging out of the website, terminate and close the SSH tunnel. diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/eip-based_access.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/eip-based_access.rst new file mode 100644 index 0000000..40ecba8 --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/eip-based_access.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_0646.html + +.. _mrs_01_0646: + +EIP-based Access +================ + +You can bind an EIP to a cluster to access the web UIs of the open-source components managed in the MRS cluster. This method is simple and easy to use and is recommended for accessing the web UIs of the open-source components. + +Binding an EIP to a Cluster and Adding a Security Group Rule +------------------------------------------------------------ + +#. On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. After the IAM users are synchronized, the **Components** tab is available. +#. Click **Access Manager** on the right of **MRS Manager**. +#. The page for accessing MRS Manager is displayed. Bind an EIP and add a security group rule. Perform the following operations only when you access the web UIs of the open-source components of the cluster for the first time. + + a. Select an available EIP from the EIP drop-down list to bind it. If there is no available EIP, click **Manage EIP** to create an EIP. If an EIP has been bound during cluster creation, skip this step. + b. Select the security group to which the security group rule to be added belongs. The security group is configured when the group is created. + c. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from the local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission of port 9022 to access Knox for accessing MRS components. + + d. Select the checkbox stating that **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address**. + e. Click **OK**. The login page is displayed. Enter the username **admin** and the password set during cluster creation. + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **HDFS**. On the displayed page, click **NameNode(**\ *Host name*\ **, active)** to access the HDFS web UI. The HDFS NameNode is used as an example. For details about the web UIs of other components, see :ref:`Web UIs of Open Source Components `. diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/index.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/index.rst new file mode 100644 index 0000000..09306e4 --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0644.html + +.. _mrs_01_0644: + +Accessing Web Pages of Open Source Components Managed in MRS Clusters +===================================================================== + +- :ref:`Web UIs of Open Source Components ` +- :ref:`List of Open Source Component Ports ` +- :ref:`Access Through Direct Connect ` +- :ref:`EIP-based Access ` +- :ref:`Access Using a Windows ECS ` +- :ref:`Creating an SSH Channel for Connecting to an MRS Cluster and Configuring the Browser ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + web_uis_of_open_source_components + list_of_open_source_component_ports + access_through_direct_connect + eip-based_access + access_using_a_windows_ecs + creating_an_ssh_channel_for_connecting_to_an_mrs_cluster_and_configuring_the_browser diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/list_of_open_source_component_ports.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/list_of_open_source_component_ports.rst new file mode 100644 index 0000000..d02f976 --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/list_of_open_source_component_ports.rst @@ -0,0 +1,664 @@ +:original_name: mrs_01_0504.html + +.. _mrs_01_0504: + +List of Open Source Component Ports +=================================== + +Common HBase Ports +------------------ + +The protocol type of all ports in the table is TCP (for MRS 1.6.3 or later). + ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (Versions Later than MRS 1.6.3) | | ++================================+========================+=================================+============================================================================================================================+ +| hbase.master.port | 21300 | 16000 | HMaster RPC port. This port is used to connect the HBase client to HMaster. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.master.info.port | 21301 | 16010 | HMaster HTTPS port. This port is used by the remote web client to connect to the HMaster UI. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.regionserver.port | 21302 | 16020 | RegoinServer (RS) RPC port. This port is used to connect the HBase client to RegionServer. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.regionserver.info.port | 21303 | 16030 | HTTPS port of the Region server. This port is used by the remote web client to connect to the RegionServer UI. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.thrift.info.port | 21304 | 9095 | Thrift Server listening port of Thrift Server | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Listening when the client is connected | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.regionserver.thrift.port | 21305 | 9090 | Thrift Server listening port of RegionServer | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Listening when the client is connected to the RegionServer | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| hbase.rest.info.port | 21308 | 8085 | Port of the RegionServer RESTServer native web page | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| ``-`` | 21309 | 21309 | REST port of RegionServer RESTServer | ++--------------------------------+------------------------+---------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + +Common HDFS Ports +----------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++============================+========================+=============================================+============================================================================================================================+ +| dfs.namenode.rpc.port | 25000 | - 9820 (versions earlier than MRS 3.\ *x*) | NameNode RPC port | +| | | - 8020 (MRS 3.\ *x* and later) | | +| | | | This port is used for: | +| | | | | +| | | | 1. Communication between the HDFS client and NameNode | +| | | | | +| | | | 2. Connection between the DataNode and NameNode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.namenode.http.port | 25002 | 9870 | HDFS HTTP port (NameNode) | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Point-to-point NameNode checkpoint operations. | +| | | | | +| | | | 2. Connecting the remote web client to the NameNode UI | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.namenode.https.port | 25003 | 9871 | HDFS HTTPS port (NameNode) | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Point-to-point NameNode checkpoint operations | +| | | | | +| | | | 2. Connecting the remote web client to the NameNode UI | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.ipc.port | 25008 | 9867 | IPC server port of DataNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connection between the client and DataNode to perform RPC operations. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.port | 25009 | 9866 | DataNode data transmission port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Transmitting data from HDFS client from or to the DataNode | +| | | | | +| | | | 2. Point-to-point DataNode data transmission | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.http.port | 25010 | 9864 | DataNode HTTP port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the DataNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.https.port | 25011 | 9865 | HTTPS port of DataNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the DataNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.JournalNode.rpc.port | 25012 | 8485 | RPC port of JournalNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Client communication to access multiple types of information | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.journalnode.http.port | 25013 | 8480 | JournalNode HTTP port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the JournalNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.journalnode.https.port | 25014 | 8481 | HTTPS port of JournalNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the JournalNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| httpfs.http.port | 25018 | 14000 | Listening port of the HttpFS HTTP server | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the HttpFS from the remote REST API | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + +Common Hive Ports +----------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++--------------------------+------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++==========================+========================+======================+====================================================================================================================+ +| templeton.port | 21055 | 9111 | Port used for WebHCat to provide the REST service | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Communication between the WebHCat client and WebHCat server | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------+ +| hive.server2.thrift.port | 21066 | 10000 | Port for HiveServer to provide Thrift services | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Communication between the HiveServer and HiveServer client | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------+ +| hive.metastore.port | 21088 | 9083 | Port for MetaStore to provide Thrift services | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Communication between the MetaStore client and MetaStore, that is, communication between HiveServer and MetaStore. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------+ +| hive.server2.webui.port | ``-`` | 10002 | Web UI port of Hive | +| | | | | +| | | | This port is used for: HTTPS/HTTP communication between Web requests and the Hive UI server | +| | | | | +| | | | This port is supported in MRS 1.9.x or later. | ++--------------------------+------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------+ + +Common Hue Ports +---------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++-----------------+------------------------+----------------------+--------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++=================+========================+======================+================================================================================+ +| HTTP_PORT | 21200 | 8888 | Port for Hue to provide HTTPS services | +| | | | | +| | | | This port is used to provide web services in HTTPS mode, which can be changed. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-----------------+------------------------+----------------------+--------------------------------------------------------------------------------+ + +Common Kafka Ports +------------------ + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++-----------------+------------------------+----------------------+-------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++=================+========================+======================+=================================================================================================+ +| port | 21005 | 9092 | Port for a broker to receive data and obtain services | ++-----------------+------------------------+----------------------+-------------------------------------------------------------------------------------------------+ +| ssl.port | 21008 | 9093 | SSL port used by a broker to receive data and obtain services | ++-----------------+------------------------+----------------------+-------------------------------------------------------------------------------------------------+ +| sasl.port | 21007 | 21007 | SASL security authentication port provided by a broker, which provides the secure Kafka service | ++-----------------+------------------------+----------------------+-------------------------------------------------------------------------------------------------+ +| sasl-ssl.port | 21009 | 21009 | Port used by a broker to provide encrypted service based on the SASL and SSL protocols | ++-----------------+------------------------+----------------------+-------------------------------------------------------------------------------------------------+ + +Common Loader Ports +------------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++-------------------+------------------------+----------------------+--------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++===================+========================+======================+======================================================================================+ +| LOADER_HTTPS_PORT | 21351 | 21351 | This port is used to provide REST APIs for configuration and running of Loader jobs. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-------------------+------------------------+----------------------+--------------------------------------------------------------------------------------+ + +Common Manager Ports +-------------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++-----------------+------------------------+----------------------------------------------------+----------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or later/Versions Earlier Than MRS 3.x) | | ++=================+========================+====================================================+================================================================+ +| ``-`` | 8080 | 8080 | Port provided by WebService for user access | +| | | | | +| | | | This port is used to access the web UI over HTTP. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-----------------+------------------------+----------------------------------------------------+----------------------------------------------------------------+ +| ``-`` | 28443 | 28443 | Port provided by WebService for user access | +| | | | | +| | | | This port is used to access the web UI over HTTPS. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-----------------+------------------------+----------------------------------------------------+----------------------------------------------------------------+ + +Common MapReduce Ports +---------------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++========================================+========================+======================+============================================================================================================================+ +| mapreduce.jobhistory.webapp.port | 26012 | 19888 | Web HTTP port of the JobHistory server | +| | | | | +| | | | This port is used for: viewing the web page of the JobHistory server | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| mapreduce.jobhistory.port | 26013 | 10020 | Port of the JobHistory server | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Task data restoration in the MapReduce client | +| | | | | +| | | | 2. Obtaining task report in the Job client | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| mapreduce.jobhistory.webapp.https.port | 26014 | 19890 | Web HTTPS port of the JobHistory server | +| | | | | +| | | | This port is used to view the web page of the JobHistory server. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ + +Common Spark Ports +------------------ + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++--------------------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++==========================+========================+======================+==================================================================================================================================================================================================================================================================+ +| hive.server2.thrift.port | 22550 | 22550 | JDBC thrift port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Socket communication between Spark2.1.0 CLI/JDBC client and server | +| | | | | +| | | | .. note:: | +| | | | | +| | | | If **hive.server2.thrift.port** is occupied, an exception indicating that the port is occupied is reported. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| spark.ui.port | 22950 | 4040 | Web UI port of JDBC | +| | | | | +| | | | This port is used for: HTTPS/HTTP communication between Web requests and the JDBC Server Web UI server | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The system verifies the port configuration. If the port is invalid, the value of the port plus 1 is used till the calculated value is valid. (A maximum number of 16 attempts are allowed. The number of attempts is specified by **spark.port.maxRetries**.) | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| spark.history.ui.port | 22500 | 18080 | JobHistory Web UI port | +| | | | | +| | | | This port is used for: HTTPS/HTTP communication between Web requests and Spark2.1.0 History Server | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The system verifies the port configuration. If the port is invalid, the value of the port plus 1 is used till the calculated value is valid. (A maximum number of 16 attempts are allowed. The number of attempts is specified by **spark.port.maxRetries**.) | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++--------------------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Common Storm Ports +------------------ + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++------------------------+------------------------+----------------------+---------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++========================+========================+======================+===========================================================================+ +| nimbus.thrift.port | 29200 | 6627 | Port for Nimbus to provide thrift services | ++------------------------+------------------------+----------------------+---------------------------------------------------------------------------+ +| supervisor.slots.ports | 29200-29499 | 6700,6701,6702,6703 | Port for receiving service requests that are forwarded from other servers | ++------------------------+------------------------+----------------------+---------------------------------------------------------------------------+ +| logviewer.https.port | 29248 | 29248 | Port for LogViewer to provide HTTPS services | ++------------------------+------------------------+----------------------+---------------------------------------------------------------------------+ +| ui.https.port | 29243 | 29243 | Port for Storm UI to provide HTTPS services (ui.https.port) | ++------------------------+------------------------+----------------------+---------------------------------------------------------------------------+ + +Common Yarn Ports +----------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++========================================+========================+======================+============================================================================================================================+ +| yarn.resourcemanager.webapp.port | 26000 | 8088 | Web HTTP port of the ResourceManager service | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| yarn.resourcemanager.webapp.https.port | 26001 | 8090 | Web HTTPS port of the ResourceManager service | +| | | | | +| | | | This port is used to access the Resource Manager web applications in security mode. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| yarn.nodemanager.webapp.port | 26006 | 8042 | NodeManager Web HTTP port | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| yarn.nodemanager.webapp.https.port | 26010 | 8044 | NodeManager Web HTTPS port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Accessing the NodeManager web application in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ + +Common ZooKeeper Ports +---------------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++-----------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++=================+========================+======================+============================================================================================================================+ +| clientPort | 24002 | 2181 | ZooKeeper client port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connection between the ZooKeeper client and server. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-----------------+------------------------+----------------------+----------------------------------------------------------------------------------------------------------------------------+ + +Common Kerberos Ports +--------------------- + +The protocol type of all ports in the table is UDP (for MRS 1.7.0 or later). + ++-----------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++=================+========================+======================+==========================================================================================================================================+ +| kdc_ports | 21732 | 21732 | Kerberos server port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Performing Kerberos authentication for components. This parameter may be used during the configuration of mutual trust between clusters. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++-----------------+------------------------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------+ + +Common OpenTSDB Ports +--------------------- + +The protocol type of the port in the table is TCP. + ++-----------------------+-----------------------+-------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Port Description | ++=======================+=======================+=================================================================================================+ +| tsd.network.port | 4242 | Web UI port of OpenTSDB | +| | | | +| | | This port is used for: HTTPS/HTTP communication between web requests and the OpenTSDB UI server | ++-----------------------+-----------------------+-------------------------------------------------------------------------------------------------+ + +Common Tez Ports +---------------- + +The protocol type of the port in the table is TCP. + +=========== ============ ================== +Parameter Default Port Port Description +=========== ============ ================== +tez.ui.port 28888 Web UI port of Tez +=========== ============ ================== + +Common KafkaManager Ports +------------------------- + +The protocol type of the port in the table is TCP. + +================== ============ =========================== +Parameter Default Port Port Description +================== ============ =========================== +kafka_manager_port 9099 Web UI port of KafkaManager +================== ============ =========================== + +Common Presto Ports +------------------- + +The protocol type of the port in the table is TCP. + ++------------------------+--------------+---------------------------------------------------------------------------+ +| Parameter | Default Port | Port Description | ++========================+==============+===========================================================================+ +| http-server.http.port | 7520 | HTTP port for Presto coordinator to provide services to external systems | ++------------------------+--------------+---------------------------------------------------------------------------+ +| http-server.https.port | 7521 | HTTPS port for Presto coordinator to provide services to external systems | ++------------------------+--------------+---------------------------------------------------------------------------+ +| http-server.http.port | 7530 | HTTP port for Presto worker to provide services to external systems | ++------------------------+--------------+---------------------------------------------------------------------------+ +| http-server.https.port | 7531 | HTTPS port for Presto worker to provide services to external systems | ++------------------------+--------------+---------------------------------------------------------------------------+ + +Common Flink Ports +------------------ + +The protocol type of the port in the table is TCP. + ++-----------------------+-----------------------+--------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Port Description | ++=======================+=======================+==================================================================================================+ +| jobmanager.web.port | 32261-32325 | Web UI port of Flink | +| | | | +| | | This port is used for: HTTP/HTTPS communication between the client web requests and Flink server | ++-----------------------+-----------------------+--------------------------------------------------------------------------------------------------+ + +Common ClickHouse Ports +----------------------- + +The protocol type of the port in the table is TCP. + ++-----------------+--------------+------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Port Description | ++=================+==============+============================================================================================================+ +| tcp_port | 9000 | TCP port for accessing the service client | ++-----------------+--------------+------------------------------------------------------------------------------------------------------------+ +| http_port | 8123 | HTTP port for accessing the service client | ++-----------------+--------------+------------------------------------------------------------------------------------------------------------+ +| https_port | 8443 | HTTPS port for accessing the service client | ++-----------------+--------------+------------------------------------------------------------------------------------------------------------+ +| tcp_port_secure | 9440 | TCP With SSL port for accessing the service client. This port is enabled only in security mode by default. | ++-----------------+--------------+------------------------------------------------------------------------------------------------------------+ + +Common Impala Ports +------------------- + +The protocol type of the port in the table is TCP. + ++-----------------+--------------+------------------------------------------------------------------------------+ +| Parameter | Default Port | Port Description | ++=================+==============+==============================================================================+ +| --beeswax_port | 21000 | Port for impala-shell communication | ++-----------------+--------------+------------------------------------------------------------------------------+ +| --hs2_port | 21050 | Port for Impala application communication | ++-----------------+--------------+------------------------------------------------------------------------------+ +| --hs2_http_port | 28000 | Port used by Impala to provide the HiveServer2 protocol for external systems | ++-----------------+--------------+------------------------------------------------------------------------------+ diff --git a/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/web_uis_of_open_source_components.rst b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/web_uis_of_open_source_components.rst new file mode 100644 index 0000000..d72383a --- /dev/null +++ b/umn/source/accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/web_uis_of_open_source_components.rst @@ -0,0 +1,158 @@ +:original_name: mrs_01_0362.html + +.. _mrs_01_0362: + +Web UIs of Open Source Components +================================= + +Scenario +-------- + +Web UIs of different components are created and hosted on the Master or Core nodes in the MRS cluster by default. You can view information about the components on these web UIs. + +Procedure for accessing the web UIs of open-source component: + +#. Select an access method. + + MRS provides the following methods for accessing the web UIs of open-source components: + + - :ref:`EIP-based Access `: This method is recommended because it is easy to bind an EIP to a cluster. Versions later than MRS 1.7.2 supports this method. + - :ref:`Access Using a Windows ECS `: Independent ECSs need to be created and configured. + - :ref:`Creating an SSH Channel for Connecting to an MRS Cluster and Configuring the Browser `: Use this method when the user and the MRS cluster are on different networks. + +#. Access the web UIs. For details, see :ref:`Table 1 `. + +.. _mrs_01_0362__sd893f53bb0b2400a8fe79f43dd2b7cf8: + +Web UIs +------- + +.. note:: + + For clusters with Kerberos authentication enabled, user **admin** does not have the management permission on each component. To access the web UI of each component, create a user who has the management permission on the corresponding component. + +.. _mrs_01_0362__td1aa324b6c0543c786768b6ea4e7d46d: + +.. table:: **Table 1** Web UI addresses of open-source components + + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Type | Web UI Type | Web UI Address | + +===========================+============================+=====================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | All Types | Manager | - Applicable to clusters of all versions | + | | | | + | | | **https://**\ *Floating IP address of Manager*\ **:28443/web** | + | | | | + | | | .. note:: | + | | | | + | | | #. Ensure that the local host can communicate with the MRS cluster. | + | | | #. Log in to the Master2 node remotely, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the floating IP address of MRS Manager. Record the value of **inet**. If the floating IP address of MRS Manager cannot be queried on the Master2 node, switch to the Master1 node to query and record the floating IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. | + | | | | + | | | - For MRS 1.8.0 or later to MRS 2.1.0 | + | | | | + | | | **https://:9022/mrsmanager?locale=en-us** | + | | | | + | | | For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. | + | | | | + | | | - For MRS 3.\ *x* or later, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Analysis cluster | HDFS NameNode | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **http://**\ *IP address of the active NameNode role instance*\ **:25002/dfshealth.html#tab-overview** | + | | | - Security cluster: On MRS Manager, choose **Services** > **HDFS** > **NameNode Web UI** > **NameNode (Active)**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **http://IP address of the active NameNode role instance:9870/dfshealth.html#tab-overview** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **HDFS** > **NameNode Web UI** > **NameNode (Active)**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **HDFS** > **NameNode Web UI** > **NameNode (Active)**. | + | | | - MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **HDFS** > **NameNode Web UI** > **NameNode (Active)**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HBase HMaster | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **https://IP address of the active HMaster role instance:21301/master-status** | + | | | - Security cluster: On MRS Manager, choose **Services** > **HBase** > **HMaster Web UI** > **HMaster (Active)**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **https://IP address of the active HMaster role instance:16010/master-status** | + | | | - Security cluster: On Manager, choose **Services** > **HBase** > **HMaster Web UI** > **HMaster (Active)**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **HBase** > **HMaster Web UI** > **HMaster (Active)**. | + | | | - MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **HBase** > **HMaster Web UI** > **HMaster (Active)**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | MapReduce JobHistoryServer | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **http://IP address of the JobHistoryServer role instance:26012/jobhistory** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **MapReduce** > **JobHistoryServer Web UI** > **JobHistoryServer**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **http://IP address of the JobHistoryServer role instance:19888/jobhistory** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **MapReduce** > **JobHistoryServer Web UI** > **JobHistoryServer**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **MapReduce** > **JobHistoryServer Web UI** > **JobHistoryServer**. | + | | | - MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **MapReduce** > **JobHistoryServer Web UI** > **JobHistoryServer**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | YARN ResourceManager | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **http://IP address of the active ResourceManager role instance:26000/cluster** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **Yarn** > **ResourceManager Web UI** > **ResourceManager (Active)**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **http://IP address of the active ResourceManager role instance:8088/cluster** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **Yarn** > **ResourceManager Web UI** > **ResourceManager (Active)**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **Yarn** > **ResourceManager Web UI** > **ResourceManager (Active)**. | + | | | - MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **Yarn** > **ResourceManager Web UI** > **ResourceManager (Active)**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Spark JobHistory | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **http://IP address of the JobHistory role instance:22500/** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **Spark** > **Spark Web UI** > **JobHistory**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **http://**\ *IP address of the JobHistory role instance*\ **:18080/** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **Spark** > **Spark Web UI** > **JobHistory**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **Spark** > **Spark Web UI** > **JobHistory**. | + | | | - MRS 3.x or later: On the Manager homepage, choose **Cluster** > **Service** > **Spark2x** > **Spark2x WebUI** > **JobHistory**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Hue | - MRS 1.6.3 or earlier | + | | | | + | | | - Normal cluster: **https://**\ *Floating IP address of Hue*\ **:21200** | + | | | - Security cluster: On MRS Manager, choose **Services** > **Hue** > **Hue Web UI** > **Hue (Active)**. | + | | | | + | | | - Versions later than MRS 1.6.3 and earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **https://Floating IP address of Hue:8888** | + | | | - Security cluster: On the Manager homepage, choose **Services** > **Hue** > **Hue Web UI** > **Hue (Active)**. | + | | | | + | | | - MRS 1.9.2 to versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **Hue** > **Hue Web UI** > **Hue (Active)**. | + | | | - On MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **Hue** > **Hue WebUI** > **Hue** **(**\ *Node name*, **Active)**. | + | | | | + | | | Loader is a graphical data migration management tool based on the open-source Sqoop web UI, and its interface is hosted on the Hue web UI. | + | | | | + | | | .. note:: | + | | | | + | | | Log in to the Master2 node remotely, and run the **ifconfig** command. In the command output, **eth0:FI_HUE** indicates the floating IP address of Hue. Record the value of **inet**. If the floating IP address of Hue cannot be queried on the Master2 node, switch to the Master1 node to query and record the floating IP address. If there is only one Master node, query and record the floating IP address of the Master node. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Tez | Versions earlier than MRS 3.x: On the cluster details page, choose **Components** > **Tez** > **Tez WebUI** > **TezUI**. | + | | | | + | | | MRS 3.\ *x* or later: On the Manager homepage, choose **Cluster** > **Services** > **Tez** > **Tez WebUI** > **TezUI**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Presto | - MRS 1.9.2 or earlier: On the Manager homepage, choose **Services** > **Presto** > **Presto Web UI** > **Coordinator (Active)**. | + | | | - MRS 1.9.2 to MRS 2.1.0: On the cluster details page, choose **Components** > **Presto** > **Presto Web UI** > **Coordinator (Active)**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stream processing cluster | Storm | - Versions earlier than MRS 1.8.0 | + | | | | + | | | - Normal cluster: **http://**\ *IP address of any UI role instance*\ **:29280/index.html** | + | | | - Security cluster: On MRS Manager, choose **Services** > **Storm** > **Storm WebUI** > **UI**. | + | | | | + | | | - MRS 1.9.2 to MRS 3.x: On the cluster details page, choose **Components** > **Storm** > **Storm Web UI** > **UI**. | + | | | - MRS 3.x: On the Manager homepage, choose **Cluster** > **Services** > **Storm** > **Storm WebUI** > **UI**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | KafkaManager | On MRS Manager, choose **Services** > **KafkaManager** > **KafkaManager Web UI** > **KafkaManager**. | + +---------------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/appendix/index.rst b/umn/source/appendix/index.rst new file mode 100644 index 0000000..c1fe1ce --- /dev/null +++ b/umn/source/appendix/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_9002.html + +.. _mrs_01_9002: + +Appendix +======== + +- :ref:`Precautions for MRS 3.x ` +- :ref:`Installing the Flume Client ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + precautions_for_mrs_3.x + installing_the_flume_client/index diff --git a/umn/source/appendix/installing_the_flume_client/index.rst b/umn/source/appendix/installing_the_flume_client/index.rst new file mode 100644 index 0000000..0ed878c --- /dev/null +++ b/umn/source/appendix/installing_the_flume_client/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0392.html + +.. _mrs_01_0392: + +Installing the Flume Client +=========================== + +- :ref:`Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x ` +- :ref:`Installing the Flume Client on MRS 3.x or Later Clusters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x + installing_the_flume_client_on_mrs_3.x_or_later_clusters diff --git a/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst b/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst new file mode 100644 index 0000000..3a33a7a --- /dev/null +++ b/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst @@ -0,0 +1,142 @@ +:original_name: mrs_01_1594.html + +.. _mrs_01_1594: + +Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x +======================================================================== + +Scenario +-------- + +To use Flume to collect logs, you must install the Flume client on a log host. You can create an ECS and install the Flume client on it. + +This section applies to MRS 3.\ *x* or earlier clusters. + +Prerequisites +------------- + +- A streaming cluster with the Flume component has been created. +- The log host is in the same VPC and subnet with the MRS cluster. +- You have obtained the username and password for logging in to the log host. + +Procedure +--------- + +#. Create an ECS that meets the requirements. + +#. Go to the cluster details page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services**. + - For MRS 1.9.2 or later, click the cluster name on the MRS console and choose **Components**. + +#. .. _mrs_01_1594__en-us_topic_0265836944_li1514145518420: + + Click **Download Client**. + + a. In **Client Type**, select **All client files**. + + b. In **Download to**, select **Remote host**. + + c. Set **Host IP Address** to the IP address of the ECS, **Host Port** to **22**, and **Save Path** to **/home/linux**. + + - If the default port **22** for logging in to an ECS through SSH has been changed, set **Host Port** to a new port. + - The value of **Save Path** contains a maximum of 256 characters. + + d. Set **Login User** to **root**. + + If another user is used, ensure that the user has permissions to read, write, and execute the save path. + + e. In **SSH Private Key**, select and upload the key file used for creating the cluster. + + f. Click **OK** to generate a client file. + + If the following information is displayed, the client package is saved. + + .. code-block:: text + + Client files downloaded to the remote host successfully. + + If the following information is displayed, check the username, password, and security group configurations of the remote host. Ensure that the username and password are correct and an inbound rule of the SSH (22) port has been added to the security group of the remote host. And then, go to :ref:`3 ` to download the client again. + + .. code-block:: text + + Failed to connect to the server. Please check the network connection or parameter settings. + +#. Choose **Flume** > **Instance**. Query the **Business IP Address** of any Flume instance and any two MonitorServer instances. + +#. Log in to the ECS using VNC. See section "Login Using VNC" in the *Elastic Cloud Service User Guide* (**Instances** > **Logging In to a Linux ECS** > **Login Using VNC**. + + Log in to the ECS using an SSH key by referring to `Login Using an SSH Key `__ and set the password. Then log in to the ECS using VNC. + +#. On the ECS, switch to user **root** and copy the installation package to the **/opt** directory. + + **sudo su - root** + + **cp /home/linux/MRS_Flume_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS_Flume_Client.tar** + +#. Run the following command to verify the configuration package of the client: + + **sha256sum -c MRS_Flume_ClientConfig.tar.sha256** + + If the following information is displayed, the file package is successfully verified: + + .. code-block:: + + MRS_Flume_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Flume_ClientConfig.tar**: + + **tar -xvf MRS_Flume_ClientConfig.tar** + +#. Run the following command to install the client running environment to a new directory, for example, **/opt/Flumeenv**. A directory is automatically generated during the client installation. + + **sh /opt/MRS_Flume_ClientConfig/install.sh /opt/Flumeenv** + + If the following information is displayed, the client running environment is successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Run the following command to configure environment variables: + + **source /opt/Flumeenv/bigdata_env** + +#. Run the following commands to decompress the Flume client package: + + **cd /opt/MRS_Flume_ClientConfig/Flume** + + **tar -xvf FusionInsight-Flume-1.6.0.tar.gz** + +#. Run the following command to check whether the password of the current user has expired: + + **chage -l root** + + If the value of **Password expires** is earlier than the current time, the password has expired. Run the **chage -M -1 root** command to validate the password. + +#. Run the following command to install the Flume client to a new directory, for example, **/opt/FlumeClient**. A directory is automatically generated during the client installation. + + **sh /opt/MRS_Flume_ClientConfig/Flume/install.sh -d /opt/FlumeClient -f** *service IP address of the MonitorServer instance* **-c** *path of the Flume configuration file* **-l /var/log/ -e** *service IP address of Flume* **-n** *name of the Flume client* + + The parameters are described as follows: + + - **-d**: indicates the installation path of the Flume client. + - (Optional) **-f**: indicates the service IP addresses of the two MonitorServer instances, separated by a comma (,). If the IP addresses are not configured, the Flume client will not send alarm information to MonitorServer, and the client information will not be displayed on MRS Manager. + - (Optional) **-c**: indicates the **properties.properties** configuration file that the Flume client loads after installation. If this parameter is not specified, the **fusioninsight-flume-1.6.0/conf/properties.properties** file in the client installation directory is used by default. The configuration file of the client is empty. You can modify the configuration file as required and the Flume client will load it automatically. + - (Optional) **-l**: indicates the log directory. The default value is **/var/log/Bigdata**. + - (Optional) **-e**: indicates the service IP address of the Flume instance. It is used to receive the monitoring indicators reported by the client. + - (Optional) **-n**: indicates the name of the Flume client. + - IBM JDK does not support **-Xloggc**. You must change **-Xloggc** to **-Xverbosegclog** in **flume/conf/flume-env.sh**. For 32-bit JDK, the value of **-Xmx** must not exceed 3.25 GB. + - In **flume/conf/flume-env.sh**, the default value of **-Xmx** is 4 GB. If the client memory is too small, you can change it to 512 MB or even 1 GB. + + For example, run **sh install.sh -d /opt/FlumeClient**. + + If the following information is displayed, the client is successfully installed: + + .. code-block:: + + install flume client successfully. diff --git a/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst b/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst new file mode 100644 index 0000000..bae5b57 --- /dev/null +++ b/umn/source/appendix/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst @@ -0,0 +1,92 @@ +:original_name: mrs_01_1595.html + +.. _mrs_01_1595: + +Installing the Flume Client on MRS 3.\ *x* or Later Clusters +============================================================ + +Scenario +-------- + +To use Flume to collect logs, you must install the Flume client on a log host. You can create an ECS and install the Flume client on it. + +This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +- A cluster with the Flume component has been created. +- The log host is in the same VPC and subnet with the MRS cluster. +- You have obtained the username and password for logging in to the log host. +- The installation directory is automatically created if it does not exist. If it exists, the directory must be left blank. The directory path cannot contain any space. + +Procedure +--------- + +#. Obtain the software package. + + Log in to the FusionInsight Manager. Choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the Flume service page that is displayed, choose **More** > **Download Client** in the upper right corner and set **Select Client Type** to **Complete Client** to download the Flume service client file. + + The file name of the client is **FusionInsight_Cluster\_**\ <*Cluster ID*>\ **\_Flume_Client.tar**. This section takes the client file **FusionInsight_Cluster_1_Flume_Client.tar** as an example. + +#. Upload the software package. + + Upload the software package to a directory, for example, **/opt/client** on the node where the Flume service client will be installed as user **user**. + + .. note:: + + **user** is the user who installs and runs the Flume client. + +#. Decompress the software package. + + Log in to the node where the Flume service client is to be installed as user **user**. Go to the directory where the installation package is installed, for example, **/opt/client**, and run the following command to decompress the installation package to the current directory: + + **cd /opt/client** + + **tar -xvf FusionInsight\_Cluster_1_Flume_Client.tar** + +#. Verify the software package. + + Run the **sha256sum -c** command to verify the decompressed file. If **OK** is returned, the verification is successful. Example: + + **sha256sum -c FusionInsight\_Cluster_1_Flume_ClientConfig.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Flume_ClientConfig.tar: OK + +#. Decompress the package. + + **tar -xvf FusionInsight\_Cluster_1_Flume_ClientConfig.tar** + +#. Run the following command in the Flume client installation directory to install the client to a specified directory (for example, **opt/FlumeClient**): After the client is installed successfully, the installation is complete. + + **cd /opt/client/FusionInsight\_Cluster_1_Flume_ClientConfig/Flume/FlumeClient** + + **./install.sh -d /**\ *opt/FlumeClient* **-f** *MonitorServerService IP address or host name of the role* **-c** *User service configuration filePath for storing properties.properties* **-s** *CPU threshold* **-l /var/log/Bigdata -e** *FlumeServer service IP address or host name* **-n** *Flume* + + .. note:: + + - **-d**: Flume client installation path + + - (Optional) **-f**: IP addresses or host names of two MonitorServer roles. The IP addresses or host names are separated by commas (,). If this parameter is not configured, the Flume client does not send alarm information to MonitorServer and information about the client cannot be viewed on the FusionInsight Manager GUI. + + - (Optional) **-c**: Service configuration file, which needs to be generated on the configuration tool page of the Flume server based on your service requirements. Upload the file to any directory on the node where the client is to be installed. If this parameter is not specified during the installation, you can upload the generated service configuration file **properties.properties** to the **/opt/FlumeClient/fusioninsight-flume-1.9.0/conf** directory after the installation. + + - (Optional) **-s**: cgroup threshold. The value is an integer ranging from 1 to 100 x *N*. *N* indicates the number of CPU cores. The default threshold is **-1**, indicating that the processes added to the cgroup are not restricted by the CPU usage. + + - (Optional) **-l**: Log path. The default value is **/var/log/Bigdata**. The user **user** must have the write permission on the directory. When the client is installed for the first time, a subdirectory named **flume-client** is generated. After the installation, subdirectories named **flume-client-**\ *n* will be generated in sequence. The letter *n* indicates a sequence number, which starts from 1 in ascending order. In the **/conf/** directory of the Flume client installation directory, modify the **ENV_VARS** file and search for the **FLUME_LOG_DIR** attribute to view the client log path. + + - (Optional) **-e**: Service IP address or host name of FlumeServer, which is used to receive statistics for the monitoring indicator reported by the client. + + - (Optional) **-n**: Name of the Flume client. You can choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Flume Management** on FusionInsight Manager to view the client name on the corresponding node. + + - If the following error message is displayed, run the **export JAVA_HOME=\ JDK path** command. + + .. code-block:: + + JAVA_HOME is null in current user,please install the JDK and set the JAVA_HOME + + - IBM JDK does not support **-Xloggc**. You must change **-Xloggc** to **-Xverbosegclog** in **flume/conf/flume-env.sh**. For 32-bit JDK, the value of **-Xmx** must not exceed 3.25 GB. + + - When installing a cross-platform client in a cluster, go to the **/opt/client/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FusionInsight-Flume-1.9.0.tar.gz** directory to install the Flume client. diff --git a/umn/source/appendix/precautions_for_mrs_3.x.rst b/umn/source/appendix/precautions_for_mrs_3.x.rst new file mode 100644 index 0000000..7145b08 --- /dev/null +++ b/umn/source/appendix/precautions_for_mrs_3.x.rst @@ -0,0 +1,64 @@ +:original_name: mrs_01_0614.html + +.. _mrs_01_0614: + +Precautions for MRS 3.x +======================= + +Purpose +------- + +Custers of versions earlier than MRS 3.x use MRS Manager to manage and monitor MRS clusters. On the Cluster Management page of the MRS management console, you can view cluster details, manage nodes, components, alarms, patches, files, jobs, tenants, and backup and restoration. In addition, you can configure Bootstrap actions and manage tags. + +MRS 3.x uses FusionInsight Manager to manage and monitor clusters. On the Cluster Management page of the MRS management console, you can view cluster details, manage nodes, components, alarms, files, jobs, Bootstrap actions, and tags. + +Some maintenance operations of the MRS 3.x cluster are different from those of earlier versions. For details, see :ref:`MRS Manager Operation Guide (Applicable to 2.x and Earlier Versions) ` and :ref:`FusionInsight Manager Operation Guide (Applicable to 3.x) `. + +Accessing MRS Manager +--------------------- + +- For details about how to access MRS Manager of versions earlier than MRS 3.x, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +- For details about how to access FusionInsight Manager of MRS 3.x, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. + +Modifying MRS Cluster Service Configuration Parameters +------------------------------------------------------ + +- For versions earlier than MRS 3.x, you can modify service configuration parameters on the cluster management page of the MRS management console. + + #. Log in to the MRS console. In the left navigation pane, choose **Clusters** > **Active Clusters**, and click a cluster name. + + #. Choose **Components** > *Name of the desired service* > **Service Configuration**. + + The **Basic Configurations** tab page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + + #. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + + #. Click **Save Configuration**. In the displayed dialog box, click **OK**. + + #. Wait until the message "Operation succeeded" is displayed. Click **Finish**. The configuration is modified. + + Check whether there is any service whose configuration has expired in the cluster. If yes, restart the corresponding service or role instance for the configuration to take effect. You can also select **Restart the affected services or instances** when saving the configuration. + +- In MRS 3.x, you need to log in to FusionInsight Manager to modify service configuration parameters. + + #. Log in to FusionInsight Manager. + + #. Choose **Cluster** > **Services**. + + #. Click the specified service name on the service management page. + + #. Click **Configurations**. + + The **Basic Configurations** tab page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + + #. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The Manager searches for the parameter in real time and displays the result. + + #. Click **Save**. In the confirmation dialog box, click **OK**. + + #. Wait until the message "Operation succeeded" is displayed. Click **Finish**. The configuration is modified. + + Check whether there is any service whose configuration has expired in the cluster. If yes, restart the corresponding service or role instance for the configuration to take effect. diff --git a/umn/source/backup_and_restoration/backing_up_metadata.rst b/umn/source/backup_and_restoration/backing_up_metadata.rst new file mode 100644 index 0000000..6f3fc91 --- /dev/null +++ b/umn/source/backup_and_restoration/backing_up_metadata.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_0319.html + +.. _mrs_01_0319: + +Backing Up Metadata +=================== + +Scenario +-------- + +To ensure metadata security or before and after a critical operation (such as scale-out/scale-in, patch installation, upgrade, or migration) on the metadata, you need to back up the metadata. The backup data can be used to recover the system in time if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impact on services. Metadata includes data of OMS, LdapServer, DBService, and NameNode. MRS Manager data to be backed up includes OMS data and LdapServer data. + +By default, metadata backup is supported by the **default** task. This section describes how to create a backup task and back up metadata on MRS. You can back up data both automatically or manually. + +Prerequisites +------------- + +- A standby cluster for backing up data has been created, and the network is normal. For the security group of each cluster, you need to add inbound rules of the security group of the peer cluster to allow access requests from all ECSs in the security group using all protocols and ports. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. Create a backup task. + + a. On the cluster details page, click **Backups & Restorations**. + b. On the **Backups** tab page, click **Create Backup Task**. + +#. Configure a backup policy. + + a. Set **Task Name** to the name of the backup task. + + b. Set **Backup Mode** to the type of the backup task. **Periodic** indicates that the backup task is periodically executed and **Manual** indicates that the backup task is manually executed. + + To create a periodic backup task, set the following parameters: + + - **Started**: indicates the time when the task is started for the first time. + - **Period**: indicates the task execution interval. The options include **By hour** and **By day**. + - **Backup Policy**: indicates the volume of data to be backed up in each task execution. Supports **Full backup at the first time and incremental backup later**, **Full backup every time**, and **Full backup once every n times**. If you select **Full backup once every n times**, you need to specify the value of **n**. + +#. Select backup sources. + + In the **Configuration** area, select the metadata type, such as **OMS** and **LdapServer**. + +#. Set backup parameters. + + a. Set **Path Type** of **OMS** and **LdapServer** to the backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. By default, the backup files are stored in *Data storage path*\ **/LocalBackup/**. If you select **LocalDir**, you need to set the maximum number of copies to specify the number of backup files that can be retained in the backup directory. + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. If you select **LocalHDFS**, set the following parameters: + + - **Target Path**: indicates the HDFS directory for storing the backup files. The save path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory. + - **Max Number of Backup Copies**: indicates the number of backup files that can be retained in the backup directory. + - **Target Instance Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + b. Click **OK**. + +#. Execute a backup task. In the **Operation** column of the created task in the backup task list, perform the following operations: + + - If **Backup Type** is set to **Periodic**, click **Back Up Now**. + - If **Backup Type** is set to **Manual**, click **Start** to start the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. The format of the backup file name is *Version_Data source_Task execution time*\ **.tar.gz**. diff --git a/umn/source/backup_and_restoration/before_you_start.rst b/umn/source/backup_and_restoration/before_you_start.rst new file mode 100644 index 0000000..0262d3b --- /dev/null +++ b/umn/source/backup_and_restoration/before_you_start.rst @@ -0,0 +1,12 @@ +:original_name: mrs_01_0605.html + +.. _mrs_01_0605: + +Before You Start +================ + +This section describes how to back up and restore data on the MRS console. + +Backup and restoration operations on the console apply only to clusters of **MRS 3.x** or earlier. + +Backup and restore operations on Manager apply to all versions. For MRS 3.\ *x* or later, see :ref:`Introduction `. For versions earlier than MRS 3.\ *x*, see :ref:`Introduction `. diff --git a/umn/source/backup_and_restoration/index.rst b/umn/source/backup_and_restoration/index.rst new file mode 100644 index 0000000..e6222b3 --- /dev/null +++ b/umn/source/backup_and_restoration/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0316.html + +.. _mrs_01_0316: + +Backup and Restoration +====================== + +- :ref:`Before You Start ` +- :ref:`Introduction ` +- :ref:`Backing Up Metadata ` +- :ref:`Restoring Metadata ` +- :ref:`Modifying Backup Tasks ` +- :ref:`Viewing Backup and Restoration Tasks ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + before_you_start + introduction + backing_up_metadata + restoring_metadata + modifying_backup_tasks + viewing_backup_and_restoration_tasks diff --git a/umn/source/backup_and_restoration/introduction.rst b/umn/source/backup_and_restoration/introduction.rst new file mode 100644 index 0000000..02bd647 --- /dev/null +++ b/umn/source/backup_and_restoration/introduction.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_0317.html + +.. _mrs_01_0317: + +Introduction +============ + +Overview +-------- + +MRS provides backup and restoration for user data and system data. The backup function is provided based on components to back up Manager data (including OMS data and LdapServer data), Hive user data, component metadata saved in DBService, and HDFS metadata. + +Backup is used in the following scenarios: + +- Routine backup is performed to ensure the data security of the system and components. +- If the system is faulty, the data backup can be used to recover the system. +- If the active cluster is completely faulty, an image cluster identical to the active cluster needs to be created. You can use the backup data to restore the active cluster. + +.. table:: **Table 1** Backing up metadata + + +-------------+-----------------------------------------------------------------------------------------------------------+ + | Backup Type | Backup Content | + +=============+===========================================================================================================+ + | OMS | Database data (excluding alarm data) and configuration data in the cluster management system (by default) | + +-------------+-----------------------------------------------------------------------------------------------------------+ + | LdapServer | User information (about usernames, passwords, keys, password policies, and user groups) | + +-------------+-----------------------------------------------------------------------------------------------------------+ + | DBService | Metadata of the components (Hive) managed by DBService | + +-------------+-----------------------------------------------------------------------------------------------------------+ + | NameNode | HDFS metadata | + +-------------+-----------------------------------------------------------------------------------------------------------+ + +Principles +---------- + +**Task** + +Before backup or restoration, you need to create a backup or restoration task and set task parameters, such as the task name, backup data source, and type of backup file save path. Data backup and restoration can be performed by executing backup and restoration tasks. When MRS is used to recover the data of HDFS, HBase, Hive, and NameNode, no cluster can be accessed. + +Each backup task can back up data of different data sources and generates an independent backup file for each data source. All the backup files generated in each backup task form a backup file set, which can be used in restoration tasks. Backup data can be stored on Linux local disks, local cluster HDFS, and standby cluster HDFS. The backup task provides the full backup or incremental backup policies. HDFS and Hive backup tasks support the incremental backup policy, while OMS, LdapServer, DBService, and NameNode backup tasks support only the full backup policy. + +.. note:: + + Task execution rules: + + - If a task is being executed, the task cannot be executed repeatedly and other tasks cannot be started, either. + - The interval at which a periodical task is automatically executed must be greater than 120s; otherwise, the task is postponed and executed in the next period. Manual tasks can be executed at any interval. + - When a period task is to be automatically executed, the current time cannot be 120s later than the task start time; otherwise, the task is postponed and executed in the next period. + - When a periodical task is locked, it cannot be automatically executed and needs to be manually unlocked. + - Before an OMS, LdapServer, DBService, or NameNode backup task starts, ensure that the LocalBackup partition on the active management node has more than 20 GB available space. Otherwise, the backup task cannot be started. + - When you are planning backup and restoration tasks, select the data to be backed up or restored strictly based on the service logic, data store structure, and database or table association. The system creates a default periodic backup task **default** whose execution interval is 24 hours to perform full backup of OMS, LdapServer, DBService, and NameNode data to the Linux local disk. + +Specifications +-------------- + +.. table:: **Table 2** Backup and restoration feature specifications + + ======================================================= ============== + Item Specifications + ======================================================= ============== + Maximum number of backup or restoration tasks 100 + Number of concurrent running tasks 1 + Maximum number of waiting tasks 199 + Maximum size of backup files on a Linux local disk (GB) 600 + ======================================================= ============== + +.. table:: **Table 3** Specifications of the **default** task + + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + | Item | OMS | LdapServer | DBService | NameNode | + +=================================+===========================================================================+============+===========+==========+ + | Backup period | 1 hour | | | | + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum number of copies | 2 | | | | + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum size of a backup file | 10 MB | 20 MB | 100 MB | 1.5 GB | + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum size of disk space used | 20 MB | 40 MB | 200 MB | 3 GB | + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + | Save path of backup data | *Save path*\ **/LocalBackup/** of the active and standby management nodes | | | | + +---------------------------------+---------------------------------------------------------------------------+------------+-----------+----------+ + +.. note:: + + The backup data of the **default** task must be periodically transferred and saved outside the cluster based on the enterprise O&M requirements. diff --git a/umn/source/backup_and_restoration/modifying_backup_tasks.rst b/umn/source/backup_and_restoration/modifying_backup_tasks.rst new file mode 100644 index 0000000..25d1538 --- /dev/null +++ b/umn/source/backup_and_restoration/modifying_backup_tasks.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_0324.html + +.. _mrs_01_0324: + +Modifying Backup Tasks +====================== + +Scenario +-------- + +You can modify the parameters of a created backup task on MRS to meet changing service requirements. The parameters of restoration tasks can only be viewed but cannot be modified. + +Impact on the System +-------------------- + +After a backup task is modified, the new parameters take effect when the task is executed next time. + +Prerequisites +------------- + +- A backup task has been created. +- A new backup task policy has been planned based on the actual situation. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the cluster details page, choose **Backups & Restorations** > **Backups**. +#. In the task list, locate a specified task, click **Modify** in the **Operation** column to go to the configuration modification page. +#. Modify task parameters on the page that is displayed. + + - The following parameters can be modified for manual backup: + + - Target Path + - Max Number of Backup Copies + + - The following parameters can be modified for periodic backup: + + - Started + - Period + - Target Path + - Max Number of Backup Copies + + .. note:: + + - When **Path Type** is set to **LocalHDFS**, **Target Path** is valid for modifying a backup task. + - After you change the value of **Target Path** for a backup task, full backup is performed by default when the task is executed for the first time. + +#. Click **OK**. diff --git a/umn/source/backup_and_restoration/restoring_metadata.rst b/umn/source/backup_and_restoration/restoring_metadata.rst new file mode 100644 index 0000000..d7e32f8 --- /dev/null +++ b/umn/source/backup_and_restoration/restoring_metadata.rst @@ -0,0 +1,120 @@ +:original_name: mrs_01_0321.html + +.. _mrs_01_0321: + +Restoring Metadata +================== + +Scenario +-------- + +Metadata needs to be recovered in the following scenarios: + +- Data is modified or deleted unexpectedly and needs to be restored. +- After a critical operation (such as an upgrade or critical data adjustment) is performed on metadata components, an exception occurs or the operation does not achieve the expected result. All modules are faulty and become unavailable. +- Data is migrated to a new cluster. + +You can create a metadata restoration task on MRS. The restoration tasks can be created manually only. + +.. important:: + + - Data restoration can be performed only when the system version during data backup is consistent with the current system version. + - To restore the data when services are normal, manually back up the latest management data first and then restore the data. Otherwise, the data that is generated after the data backup and before the data restoration will be lost. + - Use the OMS data and LdapServer data backed up at the same time to restore data. Otherwise, the service and operation may fail. + - By default, MRS clusters use DBService to store Hive metadata. + +Impact on the System +-------------------- + +- Data generated between the backup time and restoration time is lost after data restoration. +- After the data is restored, the configuration of the components that depend on DBService may expire and these components need to be restarted. + +Prerequisites +------------- + +- You have checked whether the data in the OMS and LdapServer backup files is backed up at the same time. +- You have checked whether the status of the OMS resource and the LdapServer instance is normal. If the status is abnormal, data restoration cannot be performed. +- You have checked whether the status of the cluster hosts and services is normal. If the status is abnormal, data restoration cannot be performed. +- You have checked whether the cluster host topologies during data restoration and data backup are the same. If they are different, data restoration cannot be performed and you need to back up data again. +- You have checked the services added to the cluster during data restoration and data backup are the same. If they are different, data restoration cannot be performed and you need to back up data again. +- You have checked whether the status of the active and standby DBService instances is normal. If the status is abnormal, data restoration cannot be performed. +- You have stopped the upper-layer applications depending on the cluster. +- On MRS console, you have stopped all the NameNode role instances whose data is to be recovered. Other HDFS role instances must be running properly. After data is recovered, the NameNode role instances need to be restarted. The NameNode role instances cannot be accessed before the restart. +- Check whether NameNode backup files are stored in *Data storage path*\ **/LocalBackup/** on the active management node. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. Check the location of backup data. + + a. On the cluster details page, choose **Backups & Restorations** > **Backups**. + b. In the row where the specified backup task resides, choose **More** > **View History** in the **Operation** column to display the historical execution records of the backup task. In the displayed window, locate a specified success backup record. In the **Operation** column, click **View Backup Path** to open the task execution logs. Find the following information and view the path: + + - **Backup Object**: indicates a backup data source. + - **Backup Path**: indicates the full path where the backup files are stored. + + c. Select the correct path, and manually copy the full path of backup files in **Backup Path**. + +#. Create a restoration task. + + a. On the cluster details page, choose **Backups & Restorations** > **Restorations**. + b. On the page that is displayed, click **Create Restoration Task**. + c. Set **Task Name** to the name of the restoration task. + +#. Select restoration sources. + + In the **Configuration** area, select the metadata component whose data is to be restored. + +#. Set the restoration parameters. + + a. Set **Path Type** to a backup directory type. + b. The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. If you select **LocalDir**, you need to set **Source Path** to specify the full path of the backup file. For example, *Data storage path*\ **/LocalBackup/**\ *Backup task name*\ **\_**\ *Task creation time*\ **/**\ *Data source*\ **\_**\ *Task execution time*\ **/**\ *Version number*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. If you select **LocalHDFS**, set the following parameters: + + - **Source Path**: indicates the full HDFS path of a backup file, for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Source Instance Name**: indicates the name of NameService corresponding to the backup directory when a restoration task is being executed. The default value is **hacluster**. + + c. Click **OK**. + +#. Execute the restoration task. + + In the restoration task list, locate the row where the created task resides, and click **Start** in the **Operation** column. + + - After the recovery is successful, the progress bar is in green. + - After the recovery is successful, the recovery task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and try to execute the task again by clicking **Start**. + +#. If the following metadata type is restored, perform the corresponding operations: + + - If the OMS and LdapServer metadata is restored, go to :ref:`7 `. + - If DBService data is restored, no further action is required. + - If NameNode data is restored, choose **Components** > **HDFS** > **More** > **Restart Service** on the MRS cluster details page. No further action is required. + +#. .. _mrs_01_0321__li3654235411916: + + Restart the service for the recovered data to take effect + + a. On the MRS cluster details page, click **Components**. + + b. Choose **LdapServer** > **More** > **Restart Service** and click **OK**. Wait until the LdapServer service is restarted successfully. + + c. Log in to the active management node. For details, see :ref:`Determining Active and Standby Management Nodes of Manager `. + + d. Run the following command to restart the OMS: + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/restart-oms.sh** + + The command has been executed successfully if the following information is displayed: + + .. code-block:: + + start HA successfully. + + e. On the cluster details page, click **Components**, choose **KrbServer** > **More** > **Synchronize Configuration**. Deselect **Restart the services and instances whose configurations have expired**. Click **Yes** and wait until the KrbServer service configuration is synchronized and restarted successfully. + + f. On the cluster details page, choose **Configuration** > **Synchronize Configuration** in the upper right corner, deselect **Restart the service or instance whose configurations have expired**, and click **Yes**. Wait until the cluster configuration is synchronized successfully. + + g. On the cluster details page, choose **Management Operations** > **Stop All Components** in the upper right corner. After the cluster is stopped, choose **Management Operations** > **Start All Components**, and wait for the cluster to start. diff --git a/umn/source/backup_and_restoration/viewing_backup_and_restoration_tasks.rst b/umn/source/backup_and_restoration/viewing_backup_and_restoration_tasks.rst new file mode 100644 index 0000000..04965ae --- /dev/null +++ b/umn/source/backup_and_restoration/viewing_backup_and_restoration_tasks.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_0325.html + +.. _mrs_01_0325: + +Viewing Backup and Restoration Tasks +==================================== + +Scenario +-------- + +You can view created backup and restoration tasks and check their running status on the MRS console. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the cluster details page, click **Backups & Restorations**. + +#. Click **Backups** or **Restorations**. + +#. In the task list, obtain the previous execution result in the **Task Progress** column. Green indicates that the task is executed successfully, and red indicates that the execution fails. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view the historical record of backup and restoration execution. + + In the displayed window, click **View Details** in the **Operation** column. The task execution logs and paths are displayed. + +Related Tasks +------------- + +- Modifying Backup Tasks + + For details, see :ref:`Modifying Backup Tasks `. + +- Viewing Restoration Tasks + + In the **Operation** column of the specified task in the task list, click **View Details** to view the restoration task. You can only view but cannot modify the parameters of a restoration task. + +- Executing Backup and Restoration Tasks + + In the task list, locate a specified task and click **Start** in the **Operation** column to start a backup or restoration task that is ready or fails to be executed. Executed restoration tasks cannot be repeatedly executed. + +- Stopping a Backup Task + + In the task list, locate a specified task and click **More** > **Stop** in the **Operation** column to stop a backup task that is running. + +- Deleting Backup and Restoration Tasks + + In the **Operation** column of the specified task in the task list, choose **More** > **Delete** to delete the backup and restoration tasks. After a task is deleted, the backup data is retained by default. + +- Suspending a Backup Task + + In the **Operation** column of the specified task in the task list, choose **More** > **Suspend** to suspend the backup task. Only periodic backup tasks can be suspended. Suspended backup tasks are no longer executed automatically. When you suspend a backup task that is being executed, the task execution stops. To cancel the suspension status of a task, click **More** > **Resume**. diff --git a/umn/source/change_history.rst b/umn/source/change_history.rst new file mode 100644 index 0000000..58ff638 --- /dev/null +++ b/umn/source/change_history.rst @@ -0,0 +1,158 @@ +:original_name: mrs_01_9003.html + +.. _mrs_01_9003: + +Change History +============== + ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| Release Date | What's New | ++===================================+================================================================================================================================+ +| 2022-11-01 | Modified the following content: | +| | | +| | - Added some FAQ. For details, see :ref:`FAQ `. | +| | - Updated the screenshots in some sections in :ref:`FusionInsight Manager Operation Guide (Applicable to 3.x) `. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2022-9-29 | Modified the following content: | +| | | +| | Added MRS 3.1.2-LTS.3. For details, see :ref:`Creating a Custom Cluster `. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2021-06-30 | Modified the following content: | +| | | +| | Added MRS 3.1.0-LTS.1. For details, see :ref:`Creating a Custom Cluster `. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2020-03-12 | Accepted for RM-1305 and RM-2779. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2020-03-09 | Modified the following content: | +| | | +| | Added MRS 1.9.2. For details, see :ref:`Creating a Custom Cluster `. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2020-02-22 | Modified the following content: | +| | | +| | - Added MRS 2.1.0. For details, see :ref:`Creating a Custom Cluster `. | +| | - Supported scale-out of nodes with new specifications. For details, see :ref:`Manually Scaling Out a Cluster `. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2019-07-03 | Modified the following content: | +| | | +| | :ref:`Creating a Custom Cluster ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-10-09 | Accepted in OTC 3.2. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-09-10 | Modified the following content: | +| | | +| | :ref:`Sample Scripts ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-08-30 | - Added the following content: | +| | | +| | - :ref:`Installing Third-Party Software Using Bootstrap Actions ` | +| | - :ref:`Introduction to Bootstrap Actions ` | +| | - :ref:`Preparing the Bootstrap Action Script ` | +| | - :ref:`View Execution Records ` | +| | - :ref:`Adding a Bootstrap Action ` | +| | - :ref:`Sample Scripts ` | +| | | +| | - Modified the following content: | +| | | +| | - :ref:`Creating a Custom Cluster ` | +| | - :ref:`Creating a Cluster ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-05-29 | - Modified the following content: | +| | | +| | - :ref:`Creating a Cluster ` | +| | - :ref:`Creating a Custom Cluster ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-03-16 | - Added the following content: | +| | | +| | - :ref:`Manually Scaling In a Cluster ` | +| | - :ref:`Configuring an Auto Scaling Rule ` | +| | - :ref:`Configuring Message Notification ` | +| | - :ref:`ALM-12014 Device Partition Lost ` | +| | - :ref:`ALM-12015 Device Partition File System Read-Only ` | +| | - :ref:`ALM-12043 DNS Parsing Duration Exceeds the Threshold ` | +| | - :ref:`ALM-12045 Read Packet Dropped Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12046 Write Packet Dropped Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12047 Read Packet Error Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12048 Write Packet Error Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12049 Read Throughput Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12050 Write Throughput Rate Exceeds the Threshold ` | +| | - :ref:`ALM-12051 Disk Inode Usage Exceeds the Threshold ` | +| | - :ref:`ALM-12052 Usage of Temporary TCP Ports Exceeds the Threshold ` | +| | - :ref:`ALM-12053 File Handle Usage Exceeds the Threshold ` | +| | - :ref:`ALM-12054 The Certificate File Is Invalid ` | +| | - :ref:`ALM-12055 The Certificate File Is About to Expire ` | +| | - :ref:`ALM-18008 Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold ` | +| | - :ref:`ALM-18009 Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold ` | +| | - :ref:`ALM-20002 Hue Service Unavailable ` | +| | - :ref:`ALM-43001 Spark Service Unavailable ` | +| | - :ref:`ALM-43006 Heap Memory Usage of the JobHistory Process Exceeds the Threshold ` | +| | - :ref:`ALM-43007 Non-Heap Memory Usage of the JobHistory Process Exceeds the Threshold ` | +| | - :ref:`ALM-43008 Direct Memory Usage of the JobHistory Process Exceeds the Threshold ` | +| | - :ref:`ALM-43009 JobHistory GC Time Exceeds the Threshold ` | +| | - :ref:`ALM-43010 Heap Memory Usage of the JDBCServer Process Exceeds the Threshold ` | +| | - :ref:`ALM-43011 Non-Heap Memory Usage of the JDBCServer Process Exceeds the Threshold ` | +| | - :ref:`ALM-43012 Direct Memory Usage of the JDBCServer Process Exceeds the Threshold ` | +| | - :ref:`ALM-43013 JDBCServer GC Time Exceeds the Threshold ` | +| | | +| | - Modified the following content: | +| | | +| | - :ref:`Creating a Cluster ` | +| | - :ref:`Uploading Data and Programs ` | +| | - :ref:`Creating a Job ` | +| | - :ref:`Cluster List ` | +| | - :ref:`Checking the Cluster Status ` | +| | - :ref:`Creating a Custom Cluster ` | +| | - :ref:`Viewing Basic Cluster Information ` | +| | - :ref:`Manually Scaling Out a Cluster ` | +| | - :ref:`Importing and Exporting Data ` | +| | - :ref:`Viewing Information of a Historical Cluster ` | +| | - :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) ` | +| | - :ref:`Changing the Password of an Operation User ` | +| | - :ref:`Initializing the Password of a System User ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2018-01-31 | Modified the following contents: | +| | | +| | - :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) ` | +| | - :ref:`Creating a Custom Cluster ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2017-11-08 | - Added the following content: | +| | | +| | - :ref:`Web UIs of Open Source Components ` | +| | | +| | - Modified the following contents: | +| | | +| | - :ref:`Creating a Cluster ` | +| | - :ref:`Creating a Custom Cluster ` | +| | - :ref:`Viewing Basic Cluster Information ` | +| | - :ref:`Manually Scaling Out a Cluster ` | +| | - :ref:`Viewing the Alarm List ` | +| | - :ref:`Viewing Information of a Historical Cluster ` | +| | - :ref:`Viewing Job Configuration and Logs ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2017-06-09 | - Added the following content: | +| | | +| | - :ref:`Viewing Information of a Historical Cluster ` | +| | - :ref:`Configuring Cross-Cluster Mutual Trust Relationships ` | +| | - :ref:`Configuring Users to Access Resources of a Trusted Cluster ` | +| | | +| | - Modified the following contents: | +| | | +| | - :ref:`Uploading Data and Programs ` | +| | - :ref:`Creating a Job ` | +| | - :ref:`Creating a Custom Cluster ` | +| | - :ref:`Installing a Client (Version 3.x or Later) ` | +| | - :ref:`Installing a Client (Versions Earlier Than 3.x) ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2017-04-06 | - Added the following content: | +| | | +| | - :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) ` | +| | - :ref:`MRS Multi-User Permission Management ` | +| | | +| | - Modified the following contents: | +| | | +| | - :ref:`Creating a Custom Cluster ` | +| | - :ref:`Manually Scaling Out a Cluster ` | +| | - :ref:`Viewing Basic Cluster Information ` | +| | - :ref:`Viewing and Manually Clearing an Alarm ` | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ +| 2017-02-20 | This issue is the first official release. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/configuring_a_cluster/adding_a_tag_to_a_cluster.rst b/umn/source/configuring_a_cluster/adding_a_tag_to_a_cluster.rst new file mode 100644 index 0000000..82d6aab --- /dev/null +++ b/umn/source/configuring_a_cluster/adding_a_tag_to_a_cluster.rst @@ -0,0 +1,107 @@ +:original_name: mrs_01_0048.html + +.. _mrs_01_0048: + +Adding a Tag to a Cluster +========================= + +Tags are used to identify clusters. Adding tags to clusters can help you identify and manage your cluster resources. + +You can add a maximum of 10 tags to a cluster when creating the cluster or add them on the details page of the created cluster. + +A tag consists of a tag key and a tag value. :ref:`Table 1 ` provides tag key and value requirements. + +.. _mrs_01_0048__t7d9a642e3af04b229bf4e8f93954f3ad: + +.. table:: **Table 1** Tag key and value requirements + + +-----------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Requirement | Example | + +=======================+=============================================================================================================================+=======================+ + | Key | A tag key cannot be left blank. | Organization | + | | | | + | | A tag key must be unique in a cluster. | | + | | | | + | | A tag key contains a maximum of 36 characters. | | + | | | | + | | A tag value cannot contain special characters ``(=*<>\,|/)`` or start or end with spaces. | | + +-----------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Value | A tag value contains a maximum of 43 characters. | Apache | + | | | | + | | A tag value cannot contain special characters ``(=*<>\,|/)`` or start or end with spaces. This parameter can be left blank. | | + +-----------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +Adding Tags to a Cluster +------------------------ + +You can perform the following operations to add tags to a cluster when creating the cluster. + +#. Log in to the MRS console. + +#. Click **Create** **Cluster**. The corresponding page is displayed. + +#. Click the **Custom Config** tab. + +#. Configure the cluster software and hardware by referring to :ref:`Creating a Custom Cluster `. + +#. On the **Set Advanced Options** tab page, add a tag. + + Enter the key and value of a tag to be added. + + You can add a maximum of 10 tags to a cluster and use intersections of tags to search for the target cluster. + + .. note:: + + You can also add tags to existing clusters. For details, see :ref:`Managing Tags `. + +Searching for the Target Cluster +-------------------------------- + +On the **Active Clusters** page, search for the target cluster by tag key or tag value. + +#. Log in to the MRS console. + +#. In the upper right corner of the **Active Clusters** page, click **Search by Tag** to access the search page. + +#. Enter the tag of the cluster to be searched. + + You can select a tag key or tag value from their drop-down lists. When the tag key or tag value is exactly matched, the system can automatically locate the target cluster. If you enter multiple tags, their intersections are used to search for the cluster. + +#. Click **Search**. + + The system searches for the target cluster by tag key or value. + +.. _mrs_01_0048__section188067265123: + +Managing Tags +------------- + +You can view, add, modify, and delete tags on the **Tags** tab page of the cluster. + +#. Log in to the MRS console. + +#. On the **Active Clusters** page, click the name of a cluster for which you want to manage tags. + + The cluster details page is displayed. + +#. Click the **Tags** tab and view, add, modify, and delete tags on the tab page. + + - View + + On the **Tags** tab page, you can view details about tags of the cluster, including the number of tags and the key and value of each tag. + + - Add + + Click **Add Tag** in the upper left corner. In the displayed **Add Tag** dialog box, enter the key and value of the tag to be added, and click **OK**. + + - Modify + + In the **Operation** column of the tag, click **Edit**. In the displayed **Edit Tag** page, enter new tag key and value and click **OK**. + + - Delete + + In the **Operation** column of the tag, click **Delete**. After confirmation, click **OK** in the displayed page for deleting a tag. + + .. note:: + + MRS cluster tag updates will be synchronized to every ECS in the cluster. You are advised not to modify ECS tags on the ECS console to prevent inconsistency between ECS tags and MRS cluster tags. If the number of tags of an ECS in the MRS cluster reaches the upper limit, you cannot create any tag for the MRS cluster. diff --git a/umn/source/configuring_a_cluster/communication_security_authorization.rst b/umn/source/configuring_a_cluster/communication_security_authorization.rst new file mode 100644 index 0000000..1623eda --- /dev/null +++ b/umn/source/configuring_a_cluster/communication_security_authorization.rst @@ -0,0 +1,90 @@ +:original_name: mrs_01_0786.html + +.. _mrs_01_0786: + +Communication Security Authorization +==================================== + +MRS clusters provision, manage, and use big data components through the management console. Big data components are deployed in a user's VPC. If the MRS management console needs to directly access big data components deployed in the user's VPC, you need to enable the corresponding security group rules after you have obtained user authorization. This authorization process is called secure communications. + +If the secure communications function is not enabled, MRS clusters cannot be created. If you disable the communication after a cluster is created, the cluster status will be **Network channel is not authorized** and the following functions will be affected: + +- Functions, such as big data component installation, cluster scale-out/scale-in, and Master node specification upgrade, are unavailable. +- The cluster running status, alarms, and events cannot be monitored. +- The node management, component management, alarm management, file management, job management, patch management, and tenant management functions on the cluster details page are unavailable. +- The Manager page and the website of each component cannot be accessed. + +After the secure communications function is enabled again, the cluster status is restored to **Running**, and the preceding functions become available. For details, see :ref:`Enabling Secure Communications for Clusters with This Function Disabled `. + +If the security group rules authorized in the cluster are insufficient for you to provision, manage, and use big data components, |image1| is displayed on the right of **Secure Communications**. In this case, click **Update** to update the security group rules. For details, see :ref:`Update `. + +Enabling Secure Communications During Cluster Creation +------------------------------------------------------ + +#. Log in to the MRS console. + +#. Click Create **Cluster**. The corresponding page is displayed. + +#. Click **Quick Config** or **Custom Config**. + +#. Configure cluster information by referring to :ref:`Creating a Custom Cluster `. + +#. In the **Secure Communications** area of the **Advanced Settings** tab page, select **Enable**. + +#. Click **Create** **Now**. + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +Disabling Secure Communications After a Cluster Is Created +---------------------------------------------------------- + +#. Log in to the MRS console. + +#. In the active cluster list, click the name of the cluster for which you want to disable secure communications. + + The cluster details page is displayed. + +#. Click the switch on the right of **Secure Communications** to disable authorization. In the dialog box that is displayed, click **OK**. + + After the authorization is disabled, the cluster status changes to **Network channel unauthorized**, and some functions of the cluster are unavailable. Exercise caution when performing this operation. + +.. _mrs_01_0786__section177319347305: + +Enabling Secure Communications for Clusters with This Function Disabled +----------------------------------------------------------------------- + +#. Log in to the MRS console. + +#. In the active cluster list, click the name of the cluster for which you want to enable secure communications. + + The cluster details page is displayed. + +#. Click the switch on the right of **Secure Communications** to enable the function. + + After the function is enabled, the cluster status changes to **Running**. + +.. _mrs_01_0786__section171375139619: + +Update +------ + +If the security group rules authorized in the cluster are insufficient for you to provision, manage, and use big data components, |image2| is displayed on the right of **Secure Communications**. In this case, click **Update** to update the security group rules. For details, see :ref:`Update `. + +#. Log in to the MRS console. + +#. In the active cluster list, click the name of the cluster for which you want to update secure communications. + + The cluster details page is displayed. + +#. Click **Update** on the right of **Secure Communications**. + + + .. figure:: /_static/images/en-us_image_0000001349257145.png + :alt: **Figure 1** Update + + **Figure 1** Update + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001349137565.png +.. |image2| image:: /_static/images/en-us_image_0000001296057856.png diff --git a/umn/source/configuring_a_cluster/configuring_an_auto_scaling_rule.rst b/umn/source/configuring_a_cluster/configuring_an_auto_scaling_rule.rst new file mode 100644 index 0000000..4b526ac --- /dev/null +++ b/umn/source/configuring_a_cluster/configuring_an_auto_scaling_rule.rst @@ -0,0 +1,514 @@ +:original_name: mrs_01_0061.html + +.. _mrs_01_0061: + +Configuring an Auto Scaling Rule +================================ + +Background +---------- + +In big data application scenarios, especially real-time data analysis and processing, the number of cluster nodes needs to be dynamically adjusted according to data volume changes to provide the required number of resources. The auto scaling function of MRS enables the task nodes of a cluster to be automatically scaled to match cluster loads. If the data volume changes periodically, you can configure an auto scaling rule so that the number of task nodes can be automatically adjusted in a fixed period of time before the data volume changes. + +- Auto scaling rules: You can increase or decrease task nodes based on real-time cluster loads. Auto scaling will be triggered with a certain delay when the data volume changes. +- Resource plans: Set the task node quantity based on the time range. If the data volume changes periodically, you can create resource plans to resize the cluster before the data volume changes, thereby avoiding delays in increasing or decreasing resources. + +You can configure either auto scaling rules or resource plans or both to trigger auto scaling. Configuring both resource plans and auto scaling rules improves the cluster node scalability to cope with occasionally unexpected data volume peaks. + +In some service scenarios, resources need to be reallocated or service logic needs to be modified after cluster scale-out or scale-in. If you manually scale out or scale in a cluster, you can log in to cluster nodes to reallocate resources or modify service logic. If you use auto scaling, MRS enables you to customize automation scripts for resource reallocation and service logic modification. Automation scripts can be executed before and after auto scaling and automatically adapt to service load changes, all of which eliminates manual operations. In addition, automation scripts can be fully customized and executed at various moments, meeting your personalized requirements and improving auto scaling flexibility. + +- Auto scaling rules: + + - You can set a maximum of five rules for scaling out or in a cluster, respectively. + - The system determines the scale-out and then scale-in based on your configuration sequence. Important policies take precedence over other policies to prevent repeated triggering when the expected effect cannot be achieved after a scale-out or scale-in. + - Comparison factors include greater than, greater than or equal to, less than, and less than or equal to. + - Cluster scale-out or scale-in can be triggered only after the configured metric threshold is reached for consecutive 5\ *n* (the default value of *n* is 1) minutes. + - After each scale-out or scale-in, there is a cooling duration is greater than 0, and lasts 20 minutes by defaults. + - In each cluster scale-out or scale-in, at least one node and at most 100 nodes can be added or reduced. + +- Resource plans (setting the number of Task nodes by time range): + + - You can specify a Task node range (minimum number to maximum number) in a time range. If the number of Task nodes is beyond the Task node range in a resource plan, the system triggers cluster scale-out or scale-in. + - You can set a maximum of five resource plans for a cluster. + - A resource plan cycle is by day. The start time and end time can be set to any time point between 00:00 and 23:59. The start time must be at least 30 minutes earlier than the end time. Time ranges configured for different resource plans cannot overlap. + - After a resource plan triggers cluster scale-out or scale-in, there is 10-minute cooling duration. Auto scaling will not be triggered again within the cooling time. + - When a resource plan is enabled, the number of Task nodes in the cluster is limited to the default node range configured by you in other time periods except the time period configured in the resource plan. + - If the resource plan is not enabled, the number of Task nodes is not limited to the default node range. + +- Automation scripts: + + - You can set an automation script so that it can automatically run on cluster nodes when auto scaling is triggered. + - You can set a maximum number of 10 automation scripts for a cluster. + - You can specify an automation script to be executed on one or more types of nodes. + - Automation scripts can be executed before or after scale-out or scale-in. + - Before using automation scripts, upload them to a cluster VM or OBS file system in the same region as the cluster. The automation scripts uploaded to the cluster VM can be executed only on the existing nodes. If you want to make the automation scripts run on the new nodes, upload them to the OBS file system. + +Accessing the Auto Scaling Configuration Page +--------------------------------------------- + +You can configure an auto scaling rule on the **Set Advanced Options** page during cluster creation or on the **Nodes** page after the cluster is created. + +**Configuring an auto scaling rule when creating a cluster** + +#. Log in to the MRS console. + +#. When you create a cluster containing task nodes, configure the cluster software and hardware information by referring to :ref:`Creating a Custom Cluster `. Then, on the **Set Advanced Options** page, enable **Analysis Task** and configure or modify auto scaling rules and resource plans. + + You can configure the auto scaling rules by referring to the following scenarios: + + - :ref:`Scenario 1: Using Auto Scaling Rules Alone ` + - :ref:`Scenario 2: Using Resource Plans Alone ` + - :ref:`Scenario 3: Using Both Auto Scaling Rules and Resource Plans ` + +**Configure an auto scaling rule for an existing cluster** + +#. Log in to the MRS console. + +#. In the navigation pane on the left, choose **Clusters** > **Active Clusters** and click the name of a running cluster to go to the cluster details page. + +#. Click the **Nodes** tab and then **Auto Scaling** in the **Operation** column of the task node group. The **Auto Scaling** page is displayed. + + .. note:: + + - If no task node exists in the cluster, click **Configure Task Node** to add one and then configure the auto scaling rules. + - For MRS 3.\ *x* or later, **Configure Task Node** is available only for analysis clusters, streaming clusters, and hybrid clusters. For details about how to add a task node for a custom cluster of MRS 3.\ *x* or later, see :ref:`Adding a Task Node `. + +#. Enable **Auto Scaling** and configure or modify auto scaling rules and resource plans. + + You can configure the auto scaling rules by referring to the following scenarios: + + - :ref:`Scenario 1: Using Auto Scaling Rules Alone ` + - :ref:`Scenario 2: Using Resource Plans Alone ` + - :ref:`Scenario 3: Using Both Auto Scaling Rules and Resource Plans ` + +.. _mrs_01_0061__section15610431184420: + +Scenario 1: Using Auto Scaling Rules Alone +------------------------------------------ + +The following is an example scenario: + +The number of nodes needs to be dynamically adjusted based on the Yarn resource usage. When the memory available for Yarn is less than 20% of the total memory, five nodes need to be added. When the memory available for Yarn is greater than 70% of the total memory, five nodes need to be removed. The number of nodes in a task node group ranges from 1 to 10. + +#. Go to the **Auto Scaling** page to configure auto scaling rules. + + - Configure the **Default Range** parameter. + + Enter a task node range, in which auto scaling is performed. This constraint applies to all scale-in and scale-out rules. The maximum value range allowed is 0 to 500. + + The value range in this example is 1 to 10. + + - Configure an auto scaling rule. + + To enable **Auto Scaling**, you must configure a scale-out or scale-in rule. + + a. Select **Scale-Out** or **Scale-In**. + + b. Click **Add Rule**. + + c. Configure the **Rule Name**, **If**, **Last for**, **Add**, and **Cooldown Period** parameters. + + d. Click **OK**. + + You can view, edit, or delete the rules you configured in the **Scale-out** or **Scale-in** area on the **Auto Scaling** page. You can click **Add Rule** to configure multiple rules. + +#. (Optional) Configure automation scripts. + + Set **Advanced Settings** to **Configure** and click **Created**, or click **Add Automation Script** to go to the **Automation Script** page. + + MRS 3.\ *x* does not support this operation. + + a. Set the following parameters: **Name**, **Script Path**, **Execution Node**, **Parameter**, **Executed**, and **Action upon Failure**. For details about the parameters, see :ref:`Table 4 `. + b. Click **OK** to save the automation script configurations. + +#. Click **OK**. + + .. note:: + + If you want to configure an auto scaling rule for an existing cluster, select **I agree to authorize MRS to scale out or in nodes based on the above rule**. + +.. _mrs_01_0061__section1127519214291: + +Scenario 2: Using Resource Plans Alone +-------------------------------------- + +If the data volume changes regularly every day and you want to scale out or in a cluster before the data volume changes, you can create resource plans to adjust the number of Task nodes as planned in the specified time range. + +Example: + +A real-time processing service sees a sharp increase in data volume from 7:00 to 13:00 every day. Assume that an MRS streaming cluster is used to process the service data. Five task nodes are required from 7:00 to 13:00, while only two are required at other time. + +#. Go to the **Auto Scaling** page to configure a resource plan. + + a. For example, the **Default Range** is set to **2-2**, indicating that the number of Task nodes is fixed to 2 except the time range specified in the resource plan. + + b. Click **Configure Node Range for Specific Time Range** under **Default Range** or **Add Resource Plan**. + + c. Configure **Time Range** and **Node Range**. + + For example, set **Time Range** to **07:00-13:00**, and **Node Range** to **5-5**. This indicates that the number of task nodes is fixed at 5 from 07:00 to 13:00. + + For details about parameter configurations, see :ref:`Table 3 `. You can click **Configure Node Range for Specific Time Range** to configure multiple resource plans. + + .. note:: + + - If you do not set **Node Range**, its default value will be used. + - If you set both **Node Range** and **Time Rang**\ e, the node range you set will be used during the time range you set, and the default node range will be used beyond the time range you set. If the time is not within the configured time range, the default range is used. + +#. (Optional) Configure automation scripts. + + Set **Advanced Settings** to **Configure** and click **Created**, or click **Add Automation Script** to go to the **Automation Script** page. + + MRS 3.\ *x* does not support this operation. + + a. Set the following parameters: **Name**, **Script Path**, **Execution Node**, **Parameter**, **Executed**, and **Action upon Failure**. For details about the parameters, see :ref:`Table 4 `. + b. Click **OK** to save the automation script configurations. + +#. Click **OK**. + + .. note:: + + If you want to configure an auto scaling rule for an existing cluster, select **I agree to authorize MRS to scale out or in nodes based on the above rule**. + +.. _mrs_01_0061__section10800203113299: + +Scenario 3: Using Both Auto Scaling Rules and Resource Plans +------------------------------------------------------------ + +If the data volume is not stable and the expected fluctuation may occur, the fixed Task node range cannot guarantee that the requirements in some service scenarios are met. In this case, it is necessary to adjust the number of Task nodes based on the real-time loads and resource plans. + +The following is an example scenario: + +A real-time processing service sees an unstable increase in data volume from 7:00 to 13:00 every day. For example, 5 to 8 task nodes are required from 7:00 to 13:00, and 2 to 4 are required beyond this period. Therefore, you can set an auto scaling rule based on a resource plan. When the data volume exceeds the expected value, the number of Task nodes can be adjusted if resource loads change, without exceeding the node range specified in the resource plan. When a resource plan is triggered, the number of nodes is adjusted within the specified node range with minimum affect. That is, increase nodes to the upper limit and decrease nodes to the lower limit. + +#. Go to the **Auto Scaling** page to configure auto scaling rules. + + - **Default Range** + + Enter a task node range, in which auto scaling is performed. This constraint applies to all scale-in and scale-out rules. + + For example, this parameter is set to **2-4** in this scenario. + + - **Auto Scaling** + + To enable **Auto Scaling**, you must configure a scale-out or scale-in rule. + + a. Select **Scale-Out** or **Scale-In**. + + b. Click **Add Rule**. The **Add Rule** page is displayed. + + c. Configure the **Rule Name**, **If**, **Last for**, **Add**, and **Cooldown Period** parameters. + + d. Click **OK**. + + You can view, edit, or delete the rules you configured in the **Scale-out** or **Scale-in** area on the **Auto Scaling** page. + +#. Configure a resource plan. + + a. Click **Configure Node Range for Specific Time Range** under **Default Range** or **Add Resource Plan**. + + b. Configure **Time Range** and **Node Range**. + + For example, **Time Range** is set to **07:00-13:00** and **Node Range** to **5-8**. + + For details about parameter configurations, see :ref:`Table 3 `. You can click **Configure Node Range for Specific Time Range** or **Add Resource Plan** to configure multiple resource plans. + + .. note:: + + - If you do not set **Node Range**, its default value will be used. + - If you set both **Node Range** and **Time Rang**\ e, the node range you set will be used during the time range you set, and the default node range will be used beyond the time range you set. If the time is not within the configured time range, the default range is used. + +#. (Optional) Configure automation scripts. + + Set **Advanced Settings** to **Configure** and click **Created**, or click **Add Automation Script** to go to the **Automation Script** page. + + MRS 3.\ *x* does not support this operation. + + a. Set the following parameters: **Name**, **Script Path**, **Execution Node**, **Parameter**, **Executed**, and **Action upon Failure**. For details about the parameters, see :ref:`Table 4 `. + b. Click **OK** to save the automation script configurations. + +#. Click **OK**. + + .. note:: + + If you want to configure an auto scaling rule for an existing cluster, select **I agree to authorize MRS to scale out or in nodes based on the above rule**. + +Related Information +------------------- + +When adding a rule, you can refer to :ref:`Table 1 ` to configure the corresponding metrics. + +.. _mrs_01_0061__table15133845184415: + +.. table:: **Table 1** Auto scaling metrics + + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | Cluster Type | Metric | Value Type | Description | + +===================+==========================================+=================+==============================================================================================================+ + | Streaming cluster | StormSlotAvailable | Integer | Number of available Storm slots | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotAvailablePercentage | Percentage | Percentage of available Storm slots, that is, the proportion of the available slots to total slots | + | | | | | + | | | | Value range: 0 to 100 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotUsed | Integer | Number of the used Storm slots | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotUsedPercentage | Percentage | Percentage of the used Storm slots, that is, the proportion of the used slots to total slots | + | | | | | + | | | | Value range: 0 to 100 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorMemAverageUsage | Integer | Average memory usage of the Supervisor process of Storm | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorMemAverageUsagePercentage | Percentage | Average percentage of the used memory of the Supervisor process of Storm to the total memory of the system | + | | | | | + | | | | Value range: 0 to 100 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorCPUAverageUsagePercentage | Percentage | Average percentage of the used CPUs of the Supervisor process of Storm to the total CPUs | + | | | | | + | | | | Value range: 0 to 6000 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | Analysis cluster | YARNAppPending | Integer | Number of pending tasks on YARN | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNAppPendingRatio | Ratio | Ratio of pending tasks on Yarn, that is, the ratio of pending tasks to running tasks on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNAppRunning | Integer | Number of running tasks on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerAllocated | Integer | Number of containers allocated to Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerPending | Integer | Number of pending containers on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerPendingRatio | Ratio | Ratio of pending containers on Yarn, that is, the ratio of pending containers to running containers on Yarn. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAllocated | Integer | Number of virtual CPUs (vCPUs) allocated to Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAvailable | Integer | Number of available vCPUs on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAvailablePercentage | Percentage | Percentage of available vCPUs on Yarn, that is, the proportion of available vCPUs to total vCPUs | + | | | | | + | | | | Value range: 0 to 100 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUPending | Integer | Number of pending vCPUs on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAllocated | Integer | Memory allocated to Yarn. The unit is MB. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAvailable | Integer | Available memory on Yarn. The unit is MB. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAvailablePercentage | Percentage | Percentage of available memory on Yarn, that is, the proportion of available memory to total memory on Yarn | + | | | | | + | | | | Value range: 0 to 100 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryPending | Integer | Pending memory on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-------------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0061__table12336184610200: + +.. table:: **Table 2** Auto scaling metrics + + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | Cluster Type | Metric | Value Type | Description | + +=================+==========================================+=================+==============================================================================================================+ + | Custom | StormSlotAvailable | Integer | Number of available Storm slots | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotAvailablePercentage | Percentage | Percentage of available Storm slots, that is, the proportion of the available slots to total slots | + | | | | | + | | | | Value range: 0 to 100 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotUsed | Integer | Number of the used Storm slots | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSlotUsedPercentage | Percentage | Percentage of the used Storm slots, that is, the proportion of the used slots to total slots | + | | | | | + | | | | Value range: 0 to 100 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorMemAverageUsage | Integer | Average memory usage of the Supervisor process of Storm | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorMemAverageUsagePercentage | Percentage | Average percentage of the used memory of the Supervisor process of Storm to the total memory of the system | + | | | | | + | | | | Value range: 0 to 100 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | StormSupervisorCPUAverageUsagePercentage | Percentage | Average percentage of the used CPUs of the Supervisor process of Storm to the total CPUs | + | | | | | + | | | | Value range: 0 to 6000. | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNAppPending | Integer | Number of pending tasks on YARN | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNAppPendingRatio | Ratio | Ratio of pending tasks on Yarn, that is, the ratio of pending tasks to running tasks on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNAppRunning | Integer | Number of running tasks on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerAllocated | Integer | Number of containers allocated to Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerPending | Integer | Number of pending containers on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNContainerPendingRatio | Ratio | Ratio of pending containers on Yarn, that is, the ratio of pending containers to running containers on Yarn. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAllocated | Integer | Number of virtual CPUs (vCPUs) allocated to Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAvailable | Integer | Number of available vCPUs on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUAvailablePercentage | Percentage | Percentage of available vCPUs on Yarn, that is, the proportion of available vCPUs to total vCPUs | + | | | | | + | | | | Value range: 0 to 100 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNCPUPending | Integer | Number of pending vCPUs on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAllocated | Integer | Memory allocated to Yarn. The unit is MB. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAvailable | Integer | Available memory on Yarn. The unit is MB. | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryAvailablePercentage | Percentage | Percentage of available memory on Yarn, that is, the proportion of available memory to total memory on Yarn | + | | | | | + | | | | Value range: 0 to 100 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + | | YARNMemoryPending | Integer | Pending memory on Yarn | + | | | | | + | | | | Value range: 0 to 2147483646 | + +-----------------+------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +.. note:: + + - When the value type is percentage or ratio in :ref:`Table 1 `, the valid value can be accurate to percentile. The percentage metric value is a decimal value with a percent sign (%) removed. For example, 16.80 represents 16.80%. + - Hybrid clusters support all metrics of analysis and streaming clusters. + + When the value type is percentage or ratio in :ref:`Table 2 `, the valid value can be accurate to percentile. The percentage metric value is a decimal value with a percent sign (%) removed. For example, 16.80 represents 16.80%. + +When adding a resource plan, you can set parameters by referring to :ref:`Table 3 `. + +.. _mrs_01_0061__table1846575414619: + +.. table:: **Table 3** Configuration items of a resource plan + + +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Item | Description | + +====================+=====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Time Range | Start time and End time of a resource plan are accurate to minutes, with the value ranging from **00:00** to **23:59**. For example, if a resource plan starts at 8:00 and ends at 10:00, set this parameter to 8:00-10:00. The end time must be at least 30 minutes later than the start time. | + +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Range | The number of nodes in a resource plan ranges from **0** to **500**. In the time range specified in the resource plan, if the number of Task nodes is less than the specified minimum number of nodes, it will be increased to the specified minimum value of the node range at a time. If the number of Task nodes is greater than the maximum number of nodes specified in the resource plan, the auto scaling function reduces the number of Task nodes to the maximum value of the node range at a time. The minimum number of nodes must be less than or equal to the maximum number of nodes. | + +--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + - When a resource plan is enabled, the **Default Range** value on the auto scaling page forcibly takes effect beyond the time range specified in the resource plan. For example, if **Default Range** is set to **1-2**, **Time Range** is between **08:00-10:00**, and **Node Range** is **4-5** in a resource plan, the number of Task nodes in other periods (0:00-8:00 and 10:00-23:59) of a day is forcibly limited to the default node range (1 to 2). If the number of nodes is greater than 2, auto scale-in is triggered; if the number of nodes is less than 1, auto scale-out is triggered. + - When a resource plan is not enabled, the **Default Range** takes effect in all time ranges. If the number of nodes is not within the default node range, the number of Task nodes is automatically increased or decreased to the default node range. + - Time ranges of resource plans cannot be overlapped. The overlapped time range indicates that two effective resource plans exist at a time point. For example, if resource plan 1 takes effect from **08:00** to **10:00** and resource plan 2 takes effect from **09:00** to **11:00**, the time range between **09:00** to **10:00** is overlapped. + - The time range of a resource plan must be on the same day. For example, if you want to configure a resource plan from **23:00** to **01:00** in the next day, configure two resource plans whose time ranges are **23:00-00:00** and **00:00-01:00**, respectively. + +When adding an automation script, you can set related parameters by referring to :ref:`Table 4 `. + +.. _mrs_01_0061__table15644113520578: + +.. table:: **Table 4** Configuration items of an automation script + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Item | Description | + +===================================+===========================================================================================================================================================================================================+ + | Name | Automation script name. | + | | | + | | The value can contain only digits, letters, spaces, hyphens (-), and underscores (_) and must not start with a space. | + | | | + | | The value can contain 1 to 64 characters. | + | | | + | | .. note:: | + | | | + | | A name must be unique in the same cluster. You can set the same name for different clusters. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Script Path | Script path. The value can be an OBS file system path or a local VM path. | + | | | + | | - An OBS file system path must start with **s3a://** and end with **.sh**, for example, **s3a://mrs-samples/**\ *xxx*\ **.sh**. | + | | - A local VM path must start with a slash (/) and end with **.sh**. For example, the path of the example script for installing the Zepelin is **/opt/bootstrap/zepelin/zepelin_install.sh**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Execution Node | Select a type of the node where an automation script is executed. | + | | | + | | .. note:: | + | | | + | | - If you select **Master** nodes, you can choose whether to run the script only on the active Master nodes by enabling or disabling the **Active Master** switch. | + | | - If you enable it, the script runs only on the active Master nodes. If you disable it, the script runs on all Master nodes. This switch is disabled by default. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Automation script parameter. The following predefined variables can be imported to obtain auto scaling information: | + | | | + | | - **${mrs_scale_node_num}**: Number of auto scaling nodes. The value is always positive. | + | | - **${mrs_scale_type}**: Scale-out/in type. The value can be **scale_out** or **scale_in**. | + | | - **${mrs_scale_node_hostnames}**: Host names of the auto scaling nodes. Use commas (,) to separate multiple host names. | + | | - **${mrs_scale_node_ips}**: IP address of the auto scaling nodes. Use commas (,) to separate multiple IP addresses. | + | | - **${mrs_scale_rule_name}**: Name of the triggered auto scaling rule. For a resource plan, this parameter is set to **resource_plan**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Executed | Time for executing an automation script. The following four options are supported: **Before scale-out**, **After scale-out**, **Before scale-in**, and **After scale-in**. | + | | | + | | .. note:: | + | | | + | | Assume that the execution nodes include Task nodes. | + | | | + | | - The automation script executed before scale-out cannot run on the Task nodes to be added. | + | | - The automation script executed after scale-out can run on the added Task nodes. | + | | - The automation script executed before scale-in can run on Task nodes to be deleted. | + | | - The automation script executed after scale-in cannot run on the deleted Task nodes. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Action upon Failure | Whether to continue to execute subsequent scripts and scale-out/in after the script fails to be executed. | + | | | + | | .. note:: | + | | | + | | - You are advised to set this parameter to **Continue** in the commissioning phase so that the cluster can continue the scale-out/in operation no matter whether the script is executed successfully. | + | | - If the script fails to be executed, view the log in **/var/log/Bootstrap** on the cluster VM. | + | | - The scale-in operation cannot be rolled back. Therefore, the **Action upon Failure** can only be set to **Continue** after scale-in. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + The automation script is triggered only during auto scaling. It is not triggered when the cluster node is manually scaled out or in. diff --git a/umn/source/configuring_a_cluster/creating_a_custom_cluster.rst b/umn/source/configuring_a_cluster/creating_a_custom_cluster.rst new file mode 100644 index 0000000..f4d1232 --- /dev/null +++ b/umn/source/configuring_a_cluster/creating_a_custom_cluster.rst @@ -0,0 +1,391 @@ +:original_name: mrs_01_0513.html + +.. _mrs_01_0513: + +Creating a Custom Cluster +========================= + +The first step of using MRS is to create a cluster. This section describes how to create a cluster on the **Custom Config** tab of the MRS management console. + +You can create an IAM user or user group on the IAM management console and grant it specific operation permissions, to perform refined resource management after registering an account. For details, see :ref:`Creating an MRS User `. + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + + .. note:: + + When creating a cluster, pay attention to quota notification. If a resource quota is insufficient, increase the resource quota as prompted and create a cluster. + +#. Click the **Custom Config** tab. + +#. Configure cluster information by referring to :ref:`Software Configurations ` and click **Next**. + +#. Configure cluster information by referring to :ref:`Hardware Configurations ` and click **Next**. + +#. Set advanced options by referring to :ref:`(Optional) Advanced Configuration ` and click **Apply Now**. + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + + .. note:: + + For any doubt about the pricing, click **Pricing details** in the lower left corner. + +#. Click **Back to Cluster List** to view the cluster status. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. + +.. _mrs_01_0513__section48591411155214: + +Software Configurations +----------------------- + +.. table:: **Table 1** MRS cluster software configuration + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+======================================================================================================================================================================================================================================================================================+ + | Region | Select a region. | + | | | + | | Cloud service products in different regions cannot communicate with each other over an intranet. For low network latency and quick access, select the nearest region. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Name | The cluster name must be unique. | + | | | + | | A cluster name can contain 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | The default name is **mrs**\ \_\ *xxxx*. *xxxx* is a random collection of letters and digits. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Version | Currently, MRS 1.6.3, 1.7.2, 1.9.2, 2.1.0, 3.1.0-LTS.1, and 3.1.2-LTS.3 are supported. The latest version of MRS is used by default. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Type | The cluster types are as follows: | + | | | + | | - **Analysis cluster**: is used for offline data analysis and provides Hadoop components. | + | | - **Streaming cluster**: is used for streaming tasks and provides stream processing components. | + | | - **Hybrid cluster**: is used for both offline data analysis and streaming processing and provides Hadoop components and streaming processing components. You are advised to use a hybrid cluster to perform offline data analysis and streaming processing tasks at the same time. | + | | - **Custom**: You can adjust the cluster service deployment mode based on service requirements. For details, see :ref:`Creating a Custom Topology Cluster `. (This type is currently available only in MRS 3.\ *x*.) | + | | | + | | .. note:: | + | | | + | | - MRS streaming clusters do not support job and file management functions. | + | | - To install all components in a cluster, select **Custom**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Components | MRS components are as follows.. | + | | | + | | **Components of an analysis cluster:** | + | | | + | | - Presto: open source and distributed SQL query engine | + | | | + | | - Hadoop: distributed system architecture | + | | | + | | - Spark: in-memory distributed computing framework (not supported in MRS 3.\ *x*) | + | | | + | | - Spark2x: A fast general-purpose engine for large-scale data processing. It is developed based on the open-source Spark2.x version. (supported only in MRS 3.\ *x*) | + | | | + | | - Hive: data warehouse framework built on Hadoop | + | | | + | | - OpenTSDB: a distributed, scalable time series database that can store and serve massive amounts of time series data without losing granularity (not supported in MRS 3.\ *x*) | + | | | + | | - HBase: distributed column-oriented database | + | | | + | | - Tez: an application framework which allows for a complex directed-acyclic-graph of tasks for processing data | + | | | + | | - Hue: provides the Hadoop UI capability, which enables users to analyze and process Hadoop cluster data on browsers | + | | | + | | - Loader: a tool based on source Sqoop 1.99.7, designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases (not supported in MRS 3.\ *x*) | + | | | + | | Hadoop is mandatory, and Spark and Hive must be used together. Select components based on service requirements. | + | | | + | | - Flink: a distributed big data processing engine that can perform stateful computations over both finite and infinite data streams | + | | | + | | - Oozie: a Hadoop job scheduling system (supported only in MRS 3.\ *x*) | + | | | + | | - HetuEngine: a distributed SQL query engine for heterogeneous big data sets (supported only in MRS 3.1.\ *x*-LTS) | + | | | + | | - Alluxio: a memory speed virtual distributed storage system | + | | | + | | - Ranger: a framework to enable, monitor, and manage data security across the Hadoop platform | + | | | + | | - Impala: an SQL query engine for processing huge volumes of data | + | | | + | | - ClickHouse: A column database management system (DBMS) for on-line analytical processing (OLAP). The ClickHouse cluster table engine that uses Kunpeng as the CPU architecture does not support HDFS and Kafka. | + | | | + | | - Kudu: a column-oriented data store | + | | | + | | **Components of a streaming cluster:** | + | | | + | | - Kafka: distributed messaging system | + | | - KafkaManager: Kafka cluster monitoring management tool (not supported in MRS 3.\ *x*) | + | | - Flume: distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data | + | | - ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services (supported only in MRS 3.\ *x*) | + | | - Ranger: a framework to enable, monitor, and manage data security across the Hadoop platform (supported only in MRS 3.\ *x*) | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Component Port | Use the default **Open source**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0513__section1055918551: + +Hardware Configurations +----------------------- + +.. table:: **Table 2** MRS cluster hardware configuration + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | AZ | Select the AZ associated with the region of the cluster. | + | | | + | | An AZ is a physical area that uses independent power and network resources. AZs are physically isolated but interconnected through the internal network. This improves the availability of applications. You are advised to create clusters in different AZs. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | VPC | A VPC is a secure, isolated, and logical network environment. | + | | | + | | Select the VPC for which you want to create a cluster and click **View VPC** to view the name and ID of the VPC. If no VPC is available, create one. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Subnet | A subnet provides dedicated network resources that are isolated from other networks, improving network security. | + | | | + | | Select the subnet for which you want to create a cluster. Click **View Subnet** to view details about the selected subnet. If no subnet is created in the VPC, go to the VPC console and choose **Subnets** > **Create Subnet** to create one. For details about how to configure network ACL outbound rules, see :ref:`How Do I Configure a Network ACL Outbound Rule? ` | + | | | + | | .. note:: | + | | | + | | In MRS, IP addresses are automatically assigned to clusters during cluster creation basically based on the following formula: Quantity of IP addresses = Number of cluster nodes + 2 (one for Manager; one for the DB). In addition, if the Hadoop, Hue, Sqoop, and Presto or Loader and Presto components are selected during cluster deployment, one IP address is added for each component. To create a ClickHouse cluster independently, the number of IP addresses required is calculated as follows: Number of IP addresses = Number of cluster nodes + 1 (for Manager). | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Security Group | A security group is a set of ECS access rules. It provides access policies for ECSs that have the same security protection requirements and are mutually trusted in a VPC. | + | | | + | | When you create a cluster, you can select **Auto create** from the drop-down list of **Security Group** to create a security group or select an existing security group. | + | | | + | | .. note:: | + | | | + | | When you select a security group created by yourself, ensure that the inbound rule contains a rule in which **Protocol** is set to **All**, **Port** is set to **All**, and **Source** is set to a trusted accessible IP address range. Do not use **0.0.0.0/0** as a source address. Otherwise, security risks may occur. If you do not know the trusted accessible IP address range, select **Auto create**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | EIP | After binding an EIP to an MRS cluster, you can use the EIP to access the Manager web UI of the cluster. | + | | | + | | When creating a cluster, you can select an available EIP from the drop-down list and bind it. If no EIP is available in the drop-down list, click **Manage EIP** to access the **EIPs** service page to create one. | + | | | + | | .. note:: | + | | | + | | This parameter is valid only in MRS 1.8.0 or later. | + | | | + | | The EIP must be in the same region as the cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Enterprise Project | Select the enterprise project to which the cluster belongs. To use an enterprise project, create one on the **Enterprise** > **Project Management** page. | + | | | + | | The **Enterprise Management** console of the enterprise project is designed for resource management. It helps enterprises manage cloud-based personnel, resources, permissions, and finance in a hierarchical manner, such as management of companies, departments, and projects. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 3** Cluster node information + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================================================================================================================================================================================================================================+ + | Common Node Configurations | This parameter is valid only when **Cluster Type** is set to **Custom**. For details, see :ref:`Custom Cluster Template Description `. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Type | MRS provides three types of nodes: | + | | | + | | - Master: A Master node in an MRS cluster manages the cluster, assigns executable cluster files to Core nodes, traces the execution status of each job, and monitors the DataNode running status. | + | | | + | | - Core: A Core node in a cluster processes data and stores process data in HDFS. Analysis Core nodes are created in an analysis cluster. Streaming Core nodes are created in a streaming cluster. Both analysis and streaming Core nodes are created in a hybrid cluster. | + | | | + | | - Task: A Task node in a cluster is used for computing and does not store persistent data. Yarn and Storm are mainly installed on Task nodes. Task nodes are optional, and the number of Task nodes can be zero. Analysis Task nodes are created in an analysis cluster. Streaming Task nodes are created in a streaming cluster. Both analysis and streaming Task nodes are created in a hybrid cluster. | + | | | + | | When the data volume change is small in a cluster but the cluster's service processing capabilities need to be remarkably and temporarily improved, add Task nodes to address the following situations: | + | | | + | | - Service volumes temporarily increase, for example, report processing at the end of the year. | + | | - Long-term tasks must be completed in a short time, for example, some urgent analysis tasks. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Instance Specifications | Instance specifications of Master or Core nodes. MRS supports host specifications determined by CPU, memory, and disk space. Click |image1| to configure the instance specifications, system disk, and data disk parameters of the cluster node. | + | | | + | | .. note:: | + | | | + | | - More advanced instance specifications provide better data processing. | + | | - If you select non-HDD disks for Core nodes, the disk types of Master and Core nodes are determined by **Data Disk**. | + | | - For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | System Disk | Storage type and storage space of the system disk on a node. | + | | | + | | Storage type can be any of the following: | + | | | + | | - SATA: common I/O | + | | - SAS: high I/O | + | | - SSD: ultra-high I/O | + | | - GPSSD: general-purpose SSD | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Disk | Data disk storage space of a node. To increase data storage capacity, you can add disks at the same time when creating a cluster. The following two application scenarios are involved. | + | | | + | | - Data storage and computing are separated. Data is stored in OBS, which features low cost and unlimited storage capacity. The clusters can be terminated at any time in OBS. The computing performance is determined by OBS access performance and is lower than that of HDFS. This configuration is recommended if data computing is infrequent. | + | | - Data storage and computing are not separated. Data is stored in HDFS, which features high cost, high computing performance, and limited storage capacity. Before terminating clusters, you must export and store the data. This configuration is recommended if data computing is frequent. | + | | | + | | The storage type can be any of the following: | + | | | + | | - SATA: common I/O | + | | - SAS: high I/O | + | | - SSD: ultra-high I/O | + | | - GPSSD: general-purpose SSD | + | | | + | | .. note:: | + | | | + | | More nodes in a cluster require higher disk capacity of Master nodes. To ensure stable cluster running, set the disk capacity of the Master node to over 600 GB if the number of nodes is 300 and increase it to over 1 TB if the number of nodes reaches 500. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Instance Count | Number of Master and Core nodes. | + | | | + | | For Master nodes: | + | | | + | | - If **Cluster HA** is enabled, the number of Master nodes is fixed to **2**. | + | | - If **Cluster HA** is disabled, the number of Master nodes is fixed to **1**. | + | | | + | | At least one Core node must exist and the total number of Core and Task nodes cannot exceed 500. | + | | | + | | Task: Click |image2| to add a Task node. Click |image3| to modify the instance specifications and disk configuration of a Task node. Click |image4| to delete the added Task node. | + | | | + | | .. note:: | + | | | + | | - A maximum of 500 Core nodes are supported by default. If more than 500 Core nodes are required, contact technical support. | + | | - A small number of nodes may cause clusters to run slowly while a large number of nodes may be unnecessarily costly. Set an appropriate value based on data to be processed. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Topology Adjustment | If the deployment mode in the **Common Node** does not meet the requirements, set **Topology Adjustment** to **Enable** and adjust the instance deployment mode based on service requirements. For details, see :ref:`Topology Adjustment for a Custom Cluster `. This parameter is valid only when **Cluster Type** is set to **Custom**. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0513__section15766698552: + +(Optional) Advanced Configuration +--------------------------------- + +.. table:: **Table 4** MRS cluster advanced configuration topology + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Tag | For details, see :ref:`Adding a Tag to a Cluster `. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hostname Prefix | Enter the prefix for the computer hostname of an ECS in the cluster. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Auto Scaling | Auto scaling can be configured only after you specify Task node specifications in the **Configure Hardware** step. For details about how to configure Task node specifications, see :ref:`Configuring an Auto Scaling Rule `. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Agency | By binding an agency, ECSs or BMSs can manage some of your resources. Determine whether to configure an agency based on the actual service scenario. | + | | | + | | For example, you can configure an agency of the ECS type to automatically obtain the AK/SK to access OBS. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. | + | | | + | | The **MRS_ECS_DEFAULT_AGENCY** agency has the OBSOperateAccess permission of OBS and the CESFullAccess (for users who have enabled fine-grained policies), CES Administrator, and KMS Administrator permissions in the region where the cluster is located. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Alarm | If the alarm function is enabled, the cluster maintenance personnel can be notified in a timely manner to locate faults when the cluster runs abnormally or the system is faulty. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Rule Name | Name of the rule for sending alarm messages. The value can contain only digits, letters, hyphens (-), and underscores (_). | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Topic Name | Select an existing topic or click **Create Topic** to create a topic. To deliver messages published to a topic, you need to add a subscriber to the topic. For details, see :ref:`Adding Subscriptions to a Topic `. | + | | | + | | A topic serves as a message sending channel, where publishers and subscribers can interact with each other. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Logging | Whether to collect logs when cluster creation fails. | + | | | + | | After the logging function is enabled, system logs and component run logs are automatically collected and saved to the OBS file system in scenarios such as cluster creation failures and scale-out or scale-in failures for O&M personnel to quickly locate faults. The log information is retained for a maximum of seven days. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kerberos Authentication | Whether to enable Kerberos authentication when logging in to Manager. | + | | | + | | - |image5|: If **Kerberos Authentication** is disabled, common users can use all functions of an MRS cluster. You are advised to disable Kerberos authentication in single-user scenarios. | + | | - |image6|: If **Kerberos Authentication** is enabled, common users cannot use the file and job management functions of an MRS cluster and cannot view cluster resource usage or the job records for Hadoop and Spark. To use more cluster functions, the users must contact the Manager administrator to assign more permissions. You are advised to enable Kerberos authentication in multi-user scenarios. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Username | Name of the administrator of Manager. **admin** is used by default. | + | | | + | | For versions earlier than MRS 1.7.2, this parameter needs to be configured only when **Kerberos Authentication** is enabled: |image7| | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password | Password of the Manager administrator | + | | | + | | The following requirements must be met: | + | | | + | | - Must contain 8 to 26 characters. | + | | - Must contain at least four of the following: | + | | | + | | - Lowercase letters | + | | - Uppercase letters | + | | - Digits | + | | - Have at least one of the following special characters: !?,.: -_{} [ ]@ $% ^ + = / | + | | | + | | - Cannot be the same as the username or the username spelled backwards. | + | | | + | | Password Strength: The colorbar in red, orange, and green indicates weak, medium, and strong password, respectively. | + | | | + | | For versions earlier than MRS 1.7.2, this parameter needs to be configured only when **Kerberos Authentication** is enabled: |image8| | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Enter the password of the Manager administrator again. | + | | | + | | For versions earlier than MRS 1.7.2, this parameter needs to be configured only when **Kerberos Authentication** is enabled: |image9| | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Login Mode | - Password | + | | | + | | You can log in to ECS nodes using a password. | + | | | + | | A password must meet the following requirements: | + | | | + | | #. Must be a string and 8 to 26 characters long. | + | | #. The password must contain at least four types of the following characters: uppercase letters, lowercase letters, digits, and special characters (``! ?,.: -_{} [ ]@ $% ^ + = /``). | + | | #. The password cannot be the username or the reverse username. | + | | | + | | - Key Pair | + | | | + | | Key pairs are used to log in to ECS nodes of the cluster. Select a key pair from the drop-down list. Select "I acknowledge that I have obtained private key file *SSHkey-xxx* and that without this file I will not be able to log in to my ECS." If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. | + | | | + | | A key pair, also called an SSH key, consists of a public key and a private key. You can create an SSH key and download the private key for authenticating remote login. For security, a private key can only be downloaded once. Keep it secure. | + | | | + | | Use an SSH key in either of the following two methods: | + | | | + | | #. Creating an SSH key: After you create an SSH key, a public key and a private key are generated. The public key is stored in the system, and the private key is stored in the local ECS. When you log in to an ECS, the public and private keys are used for authentication. | + | | #. Importing an SSH key: If you have obtained the public and private keys, import the public key into the system. When you log in to an ECS, the public and private keys are used for authentication. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Secure Communications | MRS clusters provision, manage, and use big data components through the management console. Big data components are deployed in a user's VPC. If the MRS management console needs to directly access big data components deployed in the user's VPC, you need to enable the corresponding security group rules after you have obtained user authorization. This authorization process is called secure communications. For details, see :ref:`Communication Security Authorization `. | + | | | + | | If the secure communications function is not enabled, MRS clusters cannot be created. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Failed to Create a Cluster +-------------------------- + +If a cluster fails to be created, the failed task will be managed on the **Manage Failed Tasks** page. Choose **Clusters** > **Active Clusters**. Click |image10| shown in :ref:`Figure 1 ` to go to the **Manage Failed Tasks** page. In the **Status** column, hover the cursor over |image11| to view the failure cause. You can delete failed tasks by referring to :ref:`Viewing Failed MRS Tasks `. + +.. _mrs_01_0513__f4c81759110fa400ea01c1805b7817d30: + +.. figure:: /_static/images/en-us_image_0000001296217700.png + :alt: **Figure 1** Failed task management + + **Figure 1** Failed task management + +:ref:`Table 5 ` lists the error codes of MRS cluster creation failures. + +.. _mrs_01_0513__ta32348b05460406dbdc7db739e0fbb38: + +.. table:: **Table 5** Error codes + + +------------+------------------------------------------------------------------------------------------------+ + | Error Code | Description | + +============+================================================================================================+ + | MRS.101 | Insufficient quota to meet your request. Contact customer service to increase the quota. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.102 | The token cannot be null or invalid. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.103 | Invalid request. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.104 | Insufficient resources. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.105 | Insufficient IP addresses in the existing subnet. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.201 | Failed due to an ECS error. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.202 | Failed due to an IAM error. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.203 | Failed due to a VPC error. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + | MRS.400 | MRS system error. Try again later or contact customer service. | + +------------+------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296057872.png +.. |image2| image:: /_static/images/en-us_image_0000001349137577.png +.. |image3| image:: /_static/images/en-us_image_0000001296057872.png +.. |image4| image:: /_static/images/en-us_image_0000001296058072.png +.. |image5| image:: /_static/images/en-us_image_0000001295898232.png +.. |image6| image:: /_static/images/en-us_image_0000001349057905.png +.. |image7| image:: /_static/images/en-us_image_0000001349057889.png +.. |image8| image:: /_static/images/en-us_image_0000001349137781.png +.. |image9| image:: /_static/images/en-us_image_0000001349257369.png +.. |image10| image:: /_static/images/en-us_image_0000001349057753.jpg +.. |image11| image:: /_static/images/en-us_image_0000001349057753.jpg diff --git a/umn/source/configuring_a_cluster/creating_a_custom_topology_cluster.rst b/umn/source/configuring_a_cluster/creating_a_custom_topology_cluster.rst new file mode 100644 index 0000000..5211109 --- /dev/null +++ b/umn/source/configuring_a_cluster/creating_a_custom_topology_cluster.rst @@ -0,0 +1,294 @@ +:original_name: mrs_01_0121.html + +.. _mrs_01_0121: + +Creating a Custom Topology Cluster +================================== + +The analysis cluster, streaming cluster, and hybrid cluster provided by MRS use fixed templates to deploy cluster processes. Therefore, you cannot customize service processes on management nodes and control nodes. + +A custom cluster provides the following functions: + +- Separated deployment of the management and control roles: The management role and control role are deployed on different Master nodes. +- Co-deployment of the management and control roles: The management and control roles are co-deployed on the Master node. +- ZooKeeper is deployed on an independent node to improve reliability. +- Components are deployed separately to avoid resource contention. + +Roles in an MRS cluster: + +- Management Node (MN): is the node to install Manager (the management system of the MRS cluster). It provides a unified access entry. Manager centrally manages nodes and services deployed in the cluster. +- Control Node (CN): controls and monitors how data nodes store and receive data, and send process status, and provides other public functions. Control nodes of MRS include HMaster, HiveServer, ResourceManager, NameNode, JournalNode, and SlapdServer. +- Data Node (DN): A data node executes the instructions sent by the management node, reports task status, stores data, and provides other public functions. Data nodes of MRS include DataNode, RegionServer, and NodeManager. + +Customizing a Cluster +--------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Custom Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Software Configurations `. + + - **Region**: Retain the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, for example, **mrs_20180321**. + - **Cluster Version**: Currently, only MRS 3.x are supported. + +#. Click **Next**. Configure hardware information. + + - **AZ**: Retain the default value. + - **VPC**: Retain the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Retain the default value. + - **Security Group**: Select **Auto create**. + - **EIP**: Select **Bind later**. + - **Enterprise Project**: Retain the default value. + - **Common Node**: For details, see :ref:`Custom Cluster Template Description `. + - **Instance Specifications**: Click |image1| to configure the instance specifications, system disk and data disk storage types, and storage space. + - **Instance Count**: Adjust the number of cluster instances based on the service volume. For details, see :ref:`Table 2 `. + - **Topology Adjustment**: If the deployment mode in the **Common Node** does not meet the requirements, you need to manually install some instances that are not deployed by default, or you need to manually install some instances, set **Topology Adjustment** to **Enable** and adjust the instance deployment mode based on service requirements. For details, see :ref:`Topology Adjustment for a Custom Cluster `. + +#. Click **Next** and set advanced options. + + For details about the parameters, see :ref:`(Optional) Advanced Configuration `. + +#. Click **Create** **Now**. + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + +.. _mrs_01_0121__section126281336123311: + +Custom Cluster Template Description +----------------------------------- + +.. table:: **Table 1** Common templates for custom clusters + + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Common Node | Description | Node Range | + +=======================+=======================================================================================================================================================================================================================================================================================================+==========================================================================================================================================================+ + | Compact | The management role and control role are deployed on the Master node, and data instances are deployed in the same node group. This deployment mode applies to scenarios where the number of control nodes is less than 100, reducing costs. | - The number of Master nodes is greater than or equal to 3 and less than or equal to 11. | + | | | - The total number of node groups is less than or equal to 10, and the total number of nodes in non-Master node groups is less than or equal to 10,000. | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OMS-separate | The management role and control role are deployed on different Master nodes, and data instances are deployed in the same node group. This deployment mode is applicable to a cluster with 100 to 500 nodes and delivers better performance in high-concurrency load scenarios. | - The number of Master nodes is greater than or equal to 5 and less than or equal to 11. | + | | | - The total number of node groups is less than or equal to 10, and the total number of nodes in non-Master node groups is less than or equal to 10,000. | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Full-size | The management role and control role are deployed on different Master nodes, and data instances are deployed in different node groups. This deployment mode is applicable to a cluster with more than 500 nodes. Components can be deployed separately, which can be used for a larger cluster scale. | - The number of Master nodes is greater than or equal to 9 and less than or equal to 11. | + | | | - The total number of node groups is less than or equal to 10, and the total number of nodes in non-Master node groups is less than or equal to 10,000. | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0121__net: + +.. table:: **Table 2** Node deployment scheme of a customized MRS cluster + + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Deployment Principle | | Applicable Scenario | Networking Rule | + +===========================================================================================================+==========================+==============================================================================================================================================================================================================================================================================================================================================+=========================================================================================================================================================================================================================================================================================================+ + | Management nodes, control nodes, and data nodes are deployed separately. | MN x 2 + CN x 9 + DN x n | (Recommended) This scheme is used when the number of data nodes is 500-2000. | - If the number of nodes in a cluster exceeds 200, the nodes are distributed to different subnets and the subnets are interconnected with each other in Layer 3 using core switches. Each subnet can contain a maximum of 200 nodes and the allocation of nodes to different subnets must be balanced. | + | | | | - If the number of nodes is less than 200, the nodes in the cluster are deployed in the same subnet and the nodes are interconnected with each other in Layer 2 using aggregation switches. | + | (This scheme requires at least eight nodes.) | | | | + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | MN x 2 + CN x 5 + DN x n | (Recommended) This scheme is used when the number of data nodes is 100-500. | | + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | MN x 2 + CN x 3 + DN x n | (Recommended) This scheme is used when the number of data nodes is 30-100. | | + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | The management nodes and control nodes are deployed together, and the data nodes are deployed separately. | (MN+CN) x 3 + DN x n | (Recommended) This scheme is used when the number of data nodes is 3-30. | Nodes in the cluster are deployed in the same subnet and are interconnected with each other at Layer 2 through aggregation switches. | + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | The management nodes, control nodes, and data nodes are deployed together. | | - This scheme is applicable to a cluster having fewer than 6 nodes. | Nodes in the cluster are deployed in the same subnet and are interconnected with each other at Layer 2 through aggregation switches. | + | | | - This scheme requires at least three nodes. | | + | | | | | + | | | .. note:: | | + | | | | | + | | | This template is not recommended in the production environment or commercial environment. | | + | | | | | + | | | - If management, control, and data nodes are co-deployed, cluster performance and reliability are greatly affected. | | + | | | - If the number of nodes meet the requirements, deploy data nodes separately. | | + | | | - If the number of nodes is insufficient to support separately deployed data nodes, use the dual-plane networking mode for this scenario. The traffic of the management network is isolated from that of the service network to prevent excessive data volumes on the service plane, ensuring correct delivery of management operations. | | + +-----------------------------------------------------------------------------------------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0121__section1948791193417: + +Topology Adjustment for a Custom Cluster +---------------------------------------- + +.. table:: **Table 3** Topology adjustment + + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Dependency | Role | Role Deployment Suggestions | Description | + +=============+==========================+==========================+===============================================================================+=========================================================================================================================================+ + | OMSServer | ``-`` | OMSServer | This role can be deployed it on the Master node and cannot be modified. | ``-`` | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | ClickHouse | Depends on ZooKeeper. | CHS (ClickHouseServer) | This role can be deployed on all nodes. | A non-Master node group with this role assigned is considered as a Core node. | + | | | | | | + | | | | Number of role instances to be deployed: an even number ranging from 2 to 256 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | CLB (ClickHouseBalancer) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 to 256 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper | ``-`` | QP(quorumpeer) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 9, with the step size of 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Hadoop | Depends on ZooKeeper. | NN(NameNode) | This role can be deployed on the Master node only. | The NameNode and ZKFC processes are deployed on the same server for cluster HA. | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | HFS (HttpFS) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 10 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | JN(JournalNode) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 60, with the step size of 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | DN(DataNode) | This role can be deployed on all nodes. | A non-Master node group with this role assigned is considered as a Core node. | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | RM(ResourceManager) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | NM(NodeManager) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | JHS(JobHistoryServer) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | TLS(TimelineServer) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 1 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Presto | Depends on Hive. | PCD(Coordinator) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | PWK(Worker) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Spark2x | - Depends on Hadoop. | JS2X(JDBCServer2x) | This role can be deployed on the Master node only. | ``-`` | + | | - Depends on Hive. | | | | + | | - Depends on ZooKeeper. | | Number of role instances to be deployed: 2 to 10 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | JH2X(JobHistory2x) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | SR2X(SparkResource2x) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 to 50 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | IS2X(IndexServer2x) | (Optional) This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 2, with the step size of 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | HBase | Depends on Hadoop. | HM(HMaster) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | TS(ThriftServer) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | RT(RESTServer) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | RS(RegionServer) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | TS1(Thrift1Server) | This role can be deployed on all nodes. | If the Hue service is installed in a cluster and HBase needs to be used on the Hue web UI, install this instance for the HBase service. | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Hive | - Depends on Hadoop. | MS(MetaStore) | This role can be deployed on the Master node only. | ``-`` | + | | - Depends on DBService. | | | | + | | | | Number of role instances to be deployed: 2 to 10 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | WH (WebHCat) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 10 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | HS(HiveServer) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 to 80 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Hue | Depends on DBService. | H(Hue) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Sqoop | Depends on Hadoop. | SC(SqoopClient) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Kafka | Depends on ZooKeeper. | B(Broker) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Flume | ``-`` | MS(MonitorServer) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | F(Flume) | This role can be deployed on all nodes. | A non-Master node group with this role assigned is considered as a Core node. | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Tez | - Depends on Hadoop. | TUI(TezUI) | This role can be deployed on the Master node only. | ``-`` | + | | - Depends on DBService. | | | | + | | - Depends on ZooKeeper. | | Number of role instances to be deployed: 1 to 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Flink | - Depends on ZooKeeper. | FR(FlinkResource) | This role can be deployed on all nodes. | ``-`` | + | | - Depends on Hadoop. | | | | + | | | | Number of role instances to be deployed: 1 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | FS(FlinkServer) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Oozie | - Depends on Hadoop. | O(oozie) | This role can be deployed on the Master node only. | ``-`` | + | | - Depends on DBService. | | | | + | | - Depends on ZooKeeper. | | Number of role instances to be deployed: 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Impala | - Depends on Hadoop. | StateStore | This role can be deployed on the Master node only. | ``-`` | + | | - Depends on Hive. | | | | + | | - Depends on DBService. | | Number of role instances to be deployed: 1 | | + | | - Depends on ZooKeeper. | | | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | Catalog | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | Impalad | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Kudu | ``-`` | KuduMaster | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 or 5 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | KuduTserver | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 3 to 10,000 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Ranger | Depends on DBService. | RA(RangerAdmin) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 to 2 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | USC(UserSync) | This role can be deployed on the Master node only. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 1 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | | | TSC (TagSync) | This role can be deployed on all nodes. | ``-`` | + | | | | | | + | | | | Number of role instances to be deployed: 0 to 1 | | + +-------------+--------------------------+--------------------------+-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296217772.png diff --git a/umn/source/configuring_a_cluster/index.rst b/umn/source/configuring_a_cluster/index.rst new file mode 100644 index 0000000..4bda504 --- /dev/null +++ b/umn/source/configuring_a_cluster/index.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0030.html + +.. _mrs_01_0030: + +Configuring a Cluster +===================== + +- :ref:`Methods of Creating MRS Clusters ` +- :ref:`Quick Creation of a Cluster ` +- :ref:`Creating a Custom Cluster ` +- :ref:`Creating a Custom Topology Cluster ` +- :ref:`Adding a Tag to a Cluster ` +- :ref:`Communication Security Authorization ` +- :ref:`Configuring an Auto Scaling Rule ` +- :ref:`Managing Data Connections ` +- :ref:`Installing Third-Party Software Using Bootstrap Actions ` +- :ref:`Viewing Failed MRS Tasks ` +- :ref:`Viewing Information of a Historical Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + methods_of_creating_mrs_clusters + quick_creation_of_a_cluster/index + creating_a_custom_cluster + creating_a_custom_topology_cluster + adding_a_tag_to_a_cluster + communication_security_authorization + configuring_an_auto_scaling_rule + managing_data_connections/index + installing_third-party_software_using_bootstrap_actions + viewing_failed_mrs_tasks + viewing_information_of_a_historical_cluster diff --git a/umn/source/configuring_a_cluster/installing_third-party_software_using_bootstrap_actions.rst b/umn/source/configuring_a_cluster/installing_third-party_software_using_bootstrap_actions.rst new file mode 100644 index 0000000..d9c2c71 --- /dev/null +++ b/umn/source/configuring_a_cluster/installing_third-party_software_using_bootstrap_actions.rst @@ -0,0 +1,113 @@ +:original_name: mrs_01_0413.html + +.. _mrs_01_0413: + +Installing Third-Party Software Using Bootstrap Actions +======================================================= + +This operation applies to MRS 3.\ *x* or earlier clusters. + +In MRS 3.\ *x*, bootstrap actions cannot be added during cluster creation. + +Prerequisites +------------- + +The bootstrap action script has been prepared by referring to :ref:`Preparing the Bootstrap Action Script `. + +Adding a Bootstrap Action When Creating a Cluster +------------------------------------------------- + +#. Log in to the MRS management console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Custom Config** tab. + +#. Configure the cluster software and hardware by referring to :ref:`Creating a Custom Cluster `. + +#. On the **Set Advanced Options** tab page, click **Add** in the **Bootstrap Action** area. + + .. table:: **Table 1** Parameters + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+====================================================================================================================================================================================================+ + | Name | Name of a bootstrap action script | + | | | + | | The value can contain only digits, letters, spaces, hyphens (-), and underscores (_) and must not start with a space. | + | | | + | | The value can contain 1 to 64 characters. | + | | | + | | .. note:: | + | | | + | | A name must be unique in the same cluster. You can set the same name for different clusters. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Script Path | Script path. The value can be an OBS file system path or a local VM path. | + | | | + | | - An OBS file system path must start with **s3a://** and end with **.sh**, for example, **s3a://mrs-samples/**\ *xxx*\ **.sh**. | + | | - A local VM path must start with a slash (/) and end with **.sh**. | + | | | + | | .. note:: | + | | | + | | A path must be unique in the same cluster, but can be the same for different clusters. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Bootstrap action script parameters | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Execution Node | Select a type of the node where the bootstrap action script is executed. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Executed | Select the time when the bootstrap action script is executed. | + | | | + | | - Before initial component start | + | | - After initial component start | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Action upon Failure | Whether to continue to execute subsequent scripts and create a cluster after the script fails to be executed. | + | | | + | | .. note:: | + | | | + | | You are advised to set this parameter to **Continue** in the debugging phase so that the cluster can continue to be installed and started no matter whether the bootstrap action is successful. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + + After the bootstrap action is added, you can edit, clone, or delete it in the **Operation** column. + +Adding an Automation Script on the Auto Scaling Page +---------------------------------------------------- + +#. Log in to the MRS management console. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to go to its details page. + +#. Click the **Nodes** tab. On this tab page, click **Auto Scaling** in the **Operation** column of the task node group. The **Auto Scaling** page is displayed. + + If no task nodes are available, click **Configure Task Node** to add a task node and then perform this step. + + .. note:: + + **Configure Task Node** is available only for MRS 3.\ *x* or later analysis, streaming, and hybrid clusters. + +#. Configure a resource plan. + + Configuration procedure: + + a. On the **Auto Scaling** page, enable **Auto Scaling**. + + b. For example, the **Default Range** of node quantity is set to **2-2**, indicating that the number of task nodes is fixed to 2 except the time range specified in the resource plan. + + c. Click **Configure Node Range for Specific Time Range** under **Default Range**. + + d. Configure **Time Range** and **Node Range**. For example, set **Time Range** to **07:00-13:00**, and **Node Range** to **5-5**. This indicates that the number of task nodes is fixed to 5 in the time range specified in the resource plan. For details about the parameters, see :ref:`Table 3 `. + + You can click **Configure Node Range for Specific Time Range** to configure multiple resource plans. + +#. (Optional) Configure automation scripts. + + a. Set **Advanced Settings** to **Configure**. + b. Click **Create**. The **Automation Script** page is displayed. + + c. Set **Name**, **Script Path**, **Execution Node**, **Parameter**, **Executed**, and **Action upon Failure**. For details about the parameters, see :ref:`Table 4 `. + d. Click **OK** to save the automation script configurations. + +#. Select **I agree to authorize MRS to scale out or scale in nodes based on the above rule**. + +#. Click **OK**. diff --git a/umn/source/configuring_a_cluster/managing_data_connections/configuring_a_hive_data_connection.rst b/umn/source/configuring_a_cluster/managing_data_connections/configuring_a_hive_data_connection.rst new file mode 100644 index 0000000..7d0b333 --- /dev/null +++ b/umn/source/configuring_a_cluster/managing_data_connections/configuring_a_hive_data_connection.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_24487.html + +.. _mrs_01_24487: + +Configuring a Hive Data Connection +================================== + +This section describes how to switch the Hive metadata of an active cluster to the metadata stored in a local database or RDS database after you create a cluster. This operation enables multiple MRS clusters to share the same metadata, and the metadata will not be deleted when the clusters are deleted. In this way, Hive metadata migration is not required during cluster migration. + +.. note:: + + - When Hive metadata is switched between different clusters, MRS synchronizes only the permissions in the metadata database of the Hive component. The permission model on MRS is maintained on MRS Manager. Therefore, when Hive metadata is switched between clusters, the permissions of users or user groups cannot be automatically synchronized to MRS Manager of another cluster. + - For clusters whose version is earlier than MRS 3.\ *x*, if the selected data connection is **RDS MySQL database**, ensure that the database user is **root**. If the user is not **root**, create a user and grant permissions to the user by referring to :ref:`Performing Operations Before Data Connection `. + - For clusters whose version is MRS 3.\ *x* or later, if the selected data connection is **RDS MySQL database**, the database user cannot be user **root**. In this case, create a user and grant permissions to the user by following the instructions provided in :ref:`Performing Operations Before Data Connection `. + + +Configuring a Hive Data Connection +---------------------------------- + +This function is not supported in MRS 3.0.5. + +#. Log in to the MRS console. In the navigation pane on the left, choose **Clusters** > **Active Clusters**. +#. Click the name of a cluster to go to the cluster details page. +#. On the **Dashboard** tab page, click **Manage** next to **Data Connection**. +#. On the **Data Connection** dialog box, the data connections associated with the cluster are displayed. You can click **Edit** or **Delete** to edit or delete the data connections. +#. If there is no associated data connection on the **Data Connection** dialog box, click **Configure Data Connection** to add a connection. + + .. note:: + + Only one data connection can be configured for a module type. For example, after a data connection is configured for Hive metadata, no other data connection can be configured for it. If no module type is available, the **Configure Data Connection** button is unavailable. + + .. table:: **Table 1** Configuring a Hive data connection + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+======================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Component | Hive | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module Type | Hive metadata | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Connection Type | - RDS MySQL database | + | | - Local database | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Instance | This parameter is valid only when **Data Connection Type** is set to **RDS PostgreSQL database** or **RDS MySQL database**. Select the name of the connection between the MRS cluster and the RDS database. This instance must be created before being referenced here. You can click **Create Data Connection** to create a data connection. For details, see :ref:`Creating a Data Connection `. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **Test** to test connectivity of the data connection. +#. After the data connection is successful, click **OK**. + + .. note:: + + - After Hive metadata is configured, restart Hive. Hive will create necessary database tables in the specified database. (If tables already exist, they will not be created.) + - Before restarting the Hive service, ensure that the driver package has been installed on all nodes where Metastore instances are located. + + - Postgres: Use the open source Postgres driver package to replace the existing one of the cluster. Upload the Postgres driver package **postgresql-42.2.5.jar** to the *${BIGDATA_HOME}*\ **/third_lib/Hive** directory on all MetaStore instance nodes. To download the open-source driver package, visit https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.5/. + - MySQL: Go to the MySQL official website (https://www.mysql.com/). Choose **DOWNLOADS** and click **MySQL Community (GPL) Downloads**. On the displayed page, click **Connector/J** to download the driver package of the corresponding version and upload the driver package to the **/opt/Bigdata/FusionInsight_HD_*/install/FusionInsight-Hive-*/hive-*/lib/** directory on all RDSMetastore nodes. diff --git a/umn/source/configuring_a_cluster/managing_data_connections/configuring_data_connections.rst b/umn/source/configuring_a_cluster/managing_data_connections/configuring_data_connections.rst new file mode 100644 index 0000000..314f739 --- /dev/null +++ b/umn/source/configuring_a_cluster/managing_data_connections/configuring_data_connections.rst @@ -0,0 +1,162 @@ +:original_name: mrs_01_0633.html + +.. _mrs_01_0633: + +Configuring Data Connections +============================ + +MRS data connections are used to manage external source connections used by components in a cluster. For example, if Hive metadata uses an external relational database, a data connection can be used to associate the external relational database with the Hive component. + +- **Local**: Metadata is stored in the local GaussDB of a cluster. When the cluster is deleted, the metadata is also deleted. To retain the metadata, manually back up the metadata in the database in advance. +- **Data Connection**: Metadata is stored in the associated PostgreSQL or MySQL database of the RDS service in the same VPC and subnet as the current cluster. When the cluster is terminated, the metadata is not deleted. Multiple MRS clusters can share the metadata. + +.. note:: + + When Hive metadata is switched between different clusters, MRS synchronizes only the permissions in the metadata database of the Hive component. The permission model on MRS is maintained on MRS Manager. Therefore, when Hive metadata is switched between clusters, the permissions of users or user groups cannot be automatically synchronized to MRS Manager of another cluster. + +.. _mrs_01_0633__section311713549458: + +Performing Operations Before Data Connection +-------------------------------------------- + +#. Log in to the RDS console. + +#. Click the **Instance Management** tab and click the name of the RDS DB instance used by the MRS data connection. + +#. Click **Log In** in the upper right corner to log in to the instance as user **root**. + +#. On the home page of the instance, click **Create Database** to create a database. + +#. .. _mrs_01_0633__li21521634318: + + On the top of the page, choose **Account Management > User Management**. + + .. note:: + + If the selected data connection is **RDS MySQL database**, ensure that the database user is user **root**. If the user is not **root**, perform :ref:`5 ` to :ref:`7 `. + +#. Click **Create User** to create a non-root user. + +#. .. _mrs_01_0633__li18495111377: + + On the top of the page, choose **SQL Operations > SQL Query**, switch to the target database by database name, and run the following SQL statements to grant permissions to the database user. In the following statements, *${db_name}* and *${db_user}* indicate the name of the database to be connected to MRS and the name of the new user, respectively. + + .. code-block:: + + grant SELECT, INSERT on mysql.* to '${db_user}'@'%' with grant option; + grant all privileges on ${db_name}.* to '${db_user}'@'%' with grant option; + grant reload on *.* to '${db_user}'@'%' with grant option; + flush privileges; + +#. Create a data connection by referring to :ref:`Creating a Data Connection `. + +.. _mrs_01_0633__section813712431913: + +Creating a Data Connection +-------------------------- + +#. Log in to the MRS management console, and choose **Data Connections** in the left navigation pane. + +#. Click **Create Data Connection**. + +#. Set parameters according to :ref:`Table 1 `. + + .. _mrs_01_0633__table1146019253265: + + .. table:: **Table 1** Data connection parameters + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Type | Type of an external source connection. | + | | | + | | - RDS for MySQL database. Clusters of that supports Hive or Ranger can connect to this type of database. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Name of a data connection. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | RDS Instance | RDS database instance. This instance must be created in RDS before being referenced here, and the database must have been created. For details, see :ref:`Performing Operations Before Data Connection `. Click **View RDS Instance** to view the created instances. | + | | | + | | .. note:: | + | | | + | | - To ensure network communications between the cluster and the PostgreSQL database, you are advised to create the instance in the same VPC and subnet as the cluster. | + | | - The inbound rule of the security group of the RDS instance must allow access of the instance to port 3306. To configure that, click the instance name on the RDS console to go to the instance management page. In **Connection Information** area, click the name of **Security Group**. On the page that is displayed, click the **Inbound Rules** tab, and click **Add Rule**. On the displayed dialog box, in **Protocol & Port** area, select **TCP** and enter port number **3306**. In **Source** area, enter the IP address of all nodes where the MetaStore instance of Hive resides. | + | | - Currently, MRS supports **PostgreSQL9.5/PostgreSQL9.6** on RDS. | + | | - Currently, MRS supports only **MySQL 5.7.**\ *x* on RDS. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Database | Name of the database to be connected to. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Username | Username for logging in to the database to be connected. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password | Password for logging in to the database to be connected. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + If the selected data connection is an **RDS MySQL** database, ensure that the database user is a **root** user. If the user is not **root**, perform operations by referring to :ref:`Performing Operations Before Data Connection `. + +#. Click **OK**. + +Editing a Data Connection +------------------------- + +#. Log in to the MRS management console, and choose **Data Connections** in the left navigation pane. + +#. In the **Operation** column of the data connection list, click **Edit** in the row where the data connection to be edited is located. + +#. Modify parameters according to :ref:`Table 1 `. + + If the selected data connection has been associated with a cluster, the configuration changes will be synchronized to the cluster. + +Deleting a Data Connection +-------------------------- + +#. Log in to the MRS management console, and choose **Data Connections** in the left navigation pane. + +#. In the **Operation** column of the data connection list, click **Delete** in the row where the data connection to be deleted is located. + + If the selected data connection has been associated with a cluster, the deletion does not affect the cluster. + +Configuring a data connection during cluster creation +----------------------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The **Create Cluster** page is displayed. + +#. Click the **Custom Config** tab. + +#. In the software configuration area, set **Metadata** by referring to :ref:`Table 2 `. For other parameters, see :ref:`Creating a Custom Cluster ` for configuration and cluster creation. + + .. _mrs_01_0633__table151701643153311: + + .. table:: **Table 2** Data connection parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Metadata | Whether to use external data sources to store metadata. | + | | | + | | - **Local**: Metadata is stored in the local cluster. | + | | - **Data connections**: Metadata of external data sources is used. If the cluster is abnormal or deleted, metadata is not affected. This mode applies to scenarios where storage and compute are decoupled. | + | | | + | | Clusters that support the Hive or Ranger component support this function. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | This parameter is available only when **Data connections** is selected for **Metadata**. It indicates the name of the component for which an external data source can be configured. | + | | | + | | - Hive | + | | - Ranger | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Connection Type | This parameter is available only when **Data connections** is selected for **Metadata**. It indicates the type of an external data source. | + | | | + | | - Hive supports the following data connection types: | + | | | + | | - RDS MySQL database | + | | - Local database | + | | | + | | - Ranger supports the following data connection types: | + | | | + | | - RDS MySQL database | + | | - Local database | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Connection Instance | This parameter is valid only when **Data Connection Type** is set to **RDS PostgreSQL database** or **RDS MySQL database**. This parameter indicates the name of the connection between the MRS cluster and the RDS database. This instance must be created before being referenced here. You can click **Create Data Connection** to create a data connection. For details, see :ref:`Performing Operations Before Data Connection ` and :ref:`Creating a Data Connection `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/configuring_a_cluster/managing_data_connections/configuring_ranger_data_connections.rst b/umn/source/configuring_a_cluster/managing_data_connections/configuring_ranger_data_connections.rst new file mode 100644 index 0000000..830b221 --- /dev/null +++ b/umn/source/configuring_a_cluster/managing_data_connections/configuring_ranger_data_connections.rst @@ -0,0 +1,145 @@ +:original_name: mrs_01_24051.html + +.. _mrs_01_24051: + +Configuring Ranger Data Connections +=================================== + +Switch the Ranger metadata of the existing cluster to the metadata stored in the RDS database. This operation enables multiple MRS clusters to share the same metadata, and the metadata will not be deleted when the clusters are deleted. In this way, Ranger metadata migration is not required during cluster migration. + +Prerequisites +------------- + +You have created an RDS MySQL database instance. For details, see :ref:`Creating a Data Connection `. + +.. note:: + + - For versions earlier than MRS 3.x, if the selected data connection is an **RDS MySQL database**, ensure that the database user is a **root** user. If the user is not **root**, create a user and grant permissions to the user by referring to :ref:`Performing Operations Before Data Connection `. + - In MRS 3.x or later, if the selected data connection is **RDS MySQL database**, the database user cannot be user **root**. In this case, create a user and grant permissions to the user by following the instructions provided in :ref:`Performing Operations Before Data Connection `. + +Preparing for MySQL Database Ranger Metadata Configuration +---------------------------------------------------------- + +This operation is required only for **MRS 3.1.0 or later**. + +#. Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Clusters** > **Services** > *Service name*. + + Currently, the following components in an MRS 3.1.\ *x* cluster support Ranger authentication: HDFS, HBase, Hive, Spark, Impala, Storm, and Kafka. + +#. In the upper right corner of the **Dashboard** page, click **More** and select **Disable Ranger**. If **Disable Ranger** is dimmed, Ranger authentication is disabled, as shown in :ref:`Figure 1 `. + + .. _mrs_01_24051__fig14437127109: + + .. figure:: /_static/images/en-us_image_0000001296217820.png + :alt: **Figure 1** Disabling Ranger authentication + + **Figure 1** Disabling Ranger authentication + +#. (Optional) To use an existing authentication policy, perform this step to export the authentication policy on the Ranger web page. After the Ranger metadata is switched, you can import the existing authentication policy again. The following uses Hive as an example. After the export, a policy file in JSON format is generated in a local directory. + + a. Log in to FusionInsight Manager. + + b. Choose **Cluster** > **Services** > **Ranger** to go to the Ranger service overview page. + + c. Click **RangerAdmin** in the **Basic Information** area to go to the Ranger web UI. + + The **admin** user in Ranger belongs to the **User** type. To view all management pages, click the username in the upper right corner and select **Log Out** to log out of the system. + + d. Log in to the system as user **rangeradmin** (default password: **Rangeradmin@123**) or another user who has the Ranger administrator permissions. + + e. Click the export button |image1| in the row where the Hive component is located to export the authentication policy. + + + .. figure:: /_static/images/en-us_image_0000001440726389.png + :alt: **Figure 2** Exporting authentication policies + + **Figure 2** Exporting authentication policies + + f. .. _mrs_01_24051__li1947954718720: + + Click **Export**. After the export is complete, a policy file in JSON format is generated in a local directory. + + + .. figure:: /_static/images/en-us_image_0000001348738221.png + :alt: **Figure 3** Exporting Hive authentication policies + + **Figure 3** Exporting Hive authentication policies + +Configuring a Data Connection for an MRS Cluster +------------------------------------------------ + +#. Log in to the MRS console. + +#. Click the name of the cluster to view its details. + +#. Click **Manage** on the right of **Data Connection** to go to the data connection configuration page. + +#. Click **Configure Data Connection** and set related parameters. + + - **Component Name**: Ranger + - **Module Type**: Ranger metadata + - **Connection Type**: RDS MySQL database + - **Connection Instance**: Select a created RDS MySQL DB instance. To create a new data connection, see :ref:`Creating a Data Connection `. + +#. Select **I understand the consequences of performing the scale-in operation** and click **Test**. + +#. After the test is successful, click **OK** to complete the data connection configuration. + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > **Services** > **Ranger** to go to the Ranger service overview page. + +#. Choose **More** > **Restart Service** or **More** > **Service Rolling Restart**. + + If you choose **Restart Service**, services will be interrupted during the restart. If you select **Service Rolling Restart**, rolling restart can minimize the impact or do not affect service running. + + Restarting Ranger will affect the permissions of all components controlled by Ranger and may affect the normal running of services. Therefore, restart Ranger when the cluster is idle or during off-peak hours. Before the Ranger component is restarted, the policies in the Ranger component still take effect. + + + .. figure:: /_static/images/en-us_image_0000001296058188.png + :alt: **Figure 4** Restarting a service + + **Figure 4** Restarting a service + +#. Enable Ranger authentication for the component to be authenticated. The Hive component is used as an example. + + Currently, the following components in an MRS 3.1.\ *x* cluster support Ranger authentication: HDFS, HBase, Hive, Spark, Impala, Storm, and Kafka. + + a. Log in to FusionInsight Manager and choose **Cluster** > **Services** > *Service Name*. + + b. In the upper right corner of the **Dashboard** page, click **More** and select **Enable Ranger**. + + + .. figure:: /_static/images/en-us_image_0000001295738404.png + :alt: **Figure 5** Enabling Ranger authentication + + **Figure 5** Enabling Ranger authentication + +#. Log in to the Ranger web UI and click the import button |image2| in the row of the Hive component. + + |image3| + +#. Import parameters. + + - Click **Select file** and select the authentication policy file downloaded in :ref:`3.f `. + - Select **Merge If Exist Policy**. + + + .. figure:: /_static/images/en-us_image_0000001296217824.png + :alt: **Figure 6** Importing authentication policies + + **Figure 6** Importing authentication policies + +#. Restart the component for which Ranger authentication is enabled. + + a. Log in to FusionInsight Manager. + + b. Choose **Cluster** > **Services** > **Hive** to go to the Hive service overview page. + + c. Choose **More** > **Restart Service** or **More** > **Service Rolling Restart**. + + If you choose **Restart Service**, services will be interrupted during the restart. If you select **Service Rolling Restart**, rolling restart can minimize the impact or do not affect service running. + +.. |image1| image:: /_static/images/en-us_image_0000001296217832.png +.. |image2| image:: /_static/images/en-us_image_0000001348738213.png +.. |image3| image:: /_static/images/en-us_image_0000001440367085.png diff --git a/umn/source/configuring_a_cluster/managing_data_connections/index.rst b/umn/source/configuring_a_cluster/managing_data_connections/index.rst new file mode 100644 index 0000000..22cb289 --- /dev/null +++ b/umn/source/configuring_a_cluster/managing_data_connections/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_24050.html + +.. _mrs_01_24050: + +Managing Data Connections +========================= + +- :ref:`Configuring Data Connections ` +- :ref:`Configuring Ranger Data Connections ` +- :ref:`Configuring a Hive Data Connection ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_data_connections + configuring_ranger_data_connections + configuring_a_hive_data_connection diff --git a/umn/source/configuring_a_cluster/methods_of_creating_mrs_clusters.rst b/umn/source/configuring_a_cluster/methods_of_creating_mrs_clusters.rst new file mode 100644 index 0000000..6fe2f79 --- /dev/null +++ b/umn/source/configuring_a_cluster/methods_of_creating_mrs_clusters.rst @@ -0,0 +1,15 @@ +:original_name: mrs_01_0025.html + +.. _mrs_01_0025: + +Methods of Creating MRS Clusters +================================ + +This section describes how to create MRS clusters. + +- :ref:`Quick Creation of a Hadoop Analysis Cluster `: On the **Quick Config** tab page, you can quickly configure parameters to create Hadoop analysis clusters within a few minutes, facilitating analysis and queries of vast amounts of data. +- :ref:`Quick Creation of an HBase Analysis Cluster `: On the **Quick Config** tab page, you can quickly configure parameters to create HBase query clusters within a few minutes, facilitating storage and distributed computing of vast amounts of data. +- :ref:`Quick Creation of a Kafka Streaming Cluster `: On the **Quick Config** tab page, you can quickly configure parameters to create Kafka streaming clusters within a few minutes, facilitating streaming data ingestion as well as real-time data processing and storage. +- :ref:`Quick Creation of a ClickHouse Cluster `: You can quickly create a ClickHouse cluster. ClickHouse is a columnar database management system used for online analysis. It features the ultimate compression rate and fast query performance. +- :ref:`Quick Creation of a Real-time Analysis Cluster `: You can create a real-time analysis cluster within a few minutes to quickly collect, analyze, and query a large amount of data. +- :ref:`Creating a Custom Cluster `: On the **Custom Config** tab page, you can flexibly configure parameters to create clusters based on application scenarios, such as ECS specifications to better suit your service requirements. diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/index.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/index.rst new file mode 100644 index 0000000..90d7377 --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_24297.html + +.. _mrs_01_24297: + +Quick Creation of a Cluster +=========================== + +- :ref:`Quick Creation of a Hadoop Analysis Cluster ` +- :ref:`Quick Creation of an HBase Analysis Cluster ` +- :ref:`Quick Creation of a Kafka Streaming Cluster ` +- :ref:`Quick Creation of a ClickHouse Cluster ` +- :ref:`Quick Creation of a Real-time Analysis Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + quick_creation_of_a_hadoop_analysis_cluster + quick_creation_of_an_hbase_analysis_cluster + quick_creation_of_a_kafka_streaming_cluster + quick_creation_of_a_clickhouse_cluster + quick_creation_of_a_real-time_analysis_cluster diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_clickhouse_cluster.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_clickhouse_cluster.rst new file mode 100644 index 0000000..9f136fe --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_clickhouse_cluster.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_2354.html + +.. _mrs_01_2354: + +Quick Creation of a ClickHouse Cluster +====================================== + +This section describes how to quickly create a ClickHouse cluster. ClickHouse is a columnar database management system used for online analysis. It features the ultimate compression rate and fast query performance. It is widely used in Internet advertisement, app and web traffic analysis, telecom, finance, and IoT fields. + +The ClickHouse cluster table engine that uses Kunpeng as the CPU architecture does not support HDFS and Kafka. + + +Quick Creation of a ClickHouse Cluster +-------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Quick Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Creating a Custom Cluster `. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, Example: **mrs_20201121**. + - **Cluster Version**: Select the latest version, which is the default value. (The components provided by a cluster vary according to the cluster version. Select a cluster version based on site requirements.) + - **Component**: Select **ClickHouse cluster**. + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Enterprise Project**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. + - **Kerberos Authentication**: Select whether to enable Kerberos authentication. + - **Username**: The default value is **root/admin**. User **root** is used to remotely log in to ECSs, and user **admin** is used to access the cluster management page. + +#. Select **Enable** to enable secure communications. For details, see :ref:`Communication Security Authorization `. + +#. **Click Apply Now.** + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. Click **Access Cluster** to view cluster details. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_hadoop_analysis_cluster.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_hadoop_analysis_cluster.rst new file mode 100644 index 0000000..46a4997 --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_hadoop_analysis_cluster.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_0512.html + +.. _mrs_01_0512: + +Quick Creation of a Hadoop Analysis Cluster +=========================================== + +This section describes how to quickly create a Hadoop analysis cluster for analyzing and querying vast amounts of data. In the open-source Hadoop ecosystem, Hadoop uses Yarn to manage cluster resources, Hive and Spark to provide offline storage and computing of large-scale distributed data, Spark Streaming and Flink to offer streaming data computing, and Presto to enable interactive queries, Tez to provide a distributed computing framework of directed acyclic graphs (DAGs). + + +Quick Creation of a Hadoop Analysis Cluster +------------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Quick Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Creating a Custom Cluster `. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, for example, **mrs_20180321**. + - **Cluster Version**: Select the latest version, which is the default value. (The components provided by a cluster vary according to the cluster version. Select a cluster version based on site requirements.) + - **Component**: Select **Hadoop analysis cluster**. + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Enterprise Project**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. + - **Cluster HA**: Use the default value. This parameter is not available in MRS 3.\ *x*. + - **Kerberos Authentication**: Select whether to enable Kerberos authentication. + - MRS 3.1.2-LTS.3 + + - **Username**: The default value is **root/admin**. User **root** is used to remotely log in to ECSs, and user **admin** is used to access the cluster management page. + - **Password**: Set a password for user **root**/**admin**. + - **Confirm Password**: Enter the password of user **root**/**admin** again. + + - Versions earlier than MRS 3.1.2-LTS.3 + + - **Username**: The default username is **admin**, which is used to log in to MRS Manager. + - **Password**: Set a password for user **admin**. + - **Confirm Password**: Enter the password of user **admin** again. + - **Key Pair**: Select a key pair from the drop-down list to log in to an ECS. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + +#. Select **Enable** to enable secure communications. For details, see :ref:`Communication Security Authorization `. + +#. **Click Create Now.** + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. Click **Access Cluster** to view cluster details. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_kafka_streaming_cluster.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_kafka_streaming_cluster.rst new file mode 100644 index 0000000..4ed6d1f --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_kafka_streaming_cluster.rst @@ -0,0 +1,50 @@ +:original_name: mrs_01_0497.html + +.. _mrs_01_0497: + +Quick Creation of a Kafka Streaming Cluster +=========================================== + +This section describes how to quickly create a Kafka streaming cluster. The Kafka cluster uses the Kafka and Storm components to provide an open-source messaging system with high throughput and scalability. It is widely used in scenarios such as log collection and monitoring data aggregation to implement efficient streaming data collection and real-time data processing and storage. + + +Quick Creation of a Kafka Streaming Cluster +------------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Quick Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Creating a Custom Cluster `. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, for example, **mrs_20200321**. + - **Cluster Version**: The components provided by a cluster vary according to the cluster version. Select a cluster version based on site requirements. + - **Component**: Select **Kafka streaming cluster**. + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Enterprise Project**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. + - **Cluster HA**: Use the default value. This parameter is not available in MRS 3.\ *x*. + - **Kerberos Authentication**: Select whether to enable Kerberos authentication. + - **Username**: The default username is **admin**, which is used to log in to MRS Manager. + - **Password**: Set a password for user **admin**. + - **Confirm Password**: Enter the password of user **admin** again. + - **Key Pair**: Select a key pair from the drop-down list to log in to an ECS. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + +#. Select **Enable** to enable secure communications. For details, see :ref:`Communication Security Authorization `. + +#. **Click Apply Now.** + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. Click **Access Cluster** to view cluster details. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_real-time_analysis_cluster.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_real-time_analysis_cluster.rst new file mode 100644 index 0000000..16aa3b9 --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_a_real-time_analysis_cluster.rst @@ -0,0 +1,61 @@ +:original_name: mrs_01_2355.html + +.. _mrs_01_2355: + +Quick Creation of a Real-time Analysis Cluster +============================================== + +This section describes how to quickly create a real-time analysis cluster. The real-time analysis cluster uses Hadoop, Kafka, Flink, and ClickHouse to collect, analyze, and query a large amount of data in real time. + +The real-time analysis cluster consists of the following components: + +- MRS 3.1.0: Hadoop 3.1.1, Kafka 2.11-2.4.0, Flink 1.12.0, ClickHouse 21.3.4.25, ZooKeeper 3.5.6, and Ranger 2.0.0. + + +Quick Creation of a Real-time Analysis Cluster +---------------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Quick Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Creating a Custom Cluster `. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, Example: **mrs_20201130**. + - **Cluster Version**: Select the latest version, which is the default value. (The components provided by a cluster vary according to the cluster version. Select a cluster version based on site requirements.) + - **Component**: Select **Real-time Analysis Cluster**. + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Enterprise Project**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. + - **Kerberos Authentication**: Select whether to enable Kerberos authentication. + - MRS 3.1.2-LTS.3 + + - **Username**: The default value is **root/admin**. User **root** is used to remotely log in to ECSs, and user **admin** is used to access the cluster management page. + - **Password**: Set a password for user **root**/**admin**. + - **Confirm Password**: Enter the password of user **root**/**admin** again. + + - Versions earlier than MRS 3.1.2-LTS.3 + + - **Username**: The default username is **admin**, which is used to log in to MRS Manager. + - **Password**: Set a password for user **admin**. + - **Confirm Password**: Enter the password of user **admin** again. + - **Key Pair**: Select a key pair from the drop-down list to log in to an ECS. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + +#. Select **Enable** to enable secure communications. For details, see :ref:`Communication Security Authorization `. + +#. **Click Apply Now.** + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. Click **Access Cluster** to view cluster details. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. diff --git a/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_an_hbase_analysis_cluster.rst b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_an_hbase_analysis_cluster.rst new file mode 100644 index 0000000..be5fd4d --- /dev/null +++ b/umn/source/configuring_a_cluster/quick_creation_of_a_cluster/quick_creation_of_an_hbase_analysis_cluster.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_0496.html + +.. _mrs_01_0496: + +Quick Creation of an HBase Analysis Cluster +=========================================== + +This section describes how to quickly create an HBase query cluster. The HBase cluster uses Hadoop and HBase components to provide a column-oriented distributed cloud storage system featuring enhanced reliability, excellent performance, and elastic scalability. It applies to the storage and distributed computing of massive amounts of data. You can use HBase to build a storage system capable of storing TB- or even PB-level data. With HBase, you can filter and analyze data with ease and get responses in milliseconds, rapidly mining data value. + + +Quick Creation of an HBase Analysis Cluster +------------------------------------------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Quick Config** tab. + +#. Configure basic cluster information. For details about the parameters, see :ref:`Creating a Custom Cluster `. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, for example, **mrs_20180321**. + - **Cluster Version**: Select the latest version, which is the default value. (The components provided by a cluster vary according to the cluster version. Select a cluster version based on site requirements.) + - **Component**: Select **HBase Query Cluster**. + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Enterprise Project**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. For MRS 3.\ *x* or later, the memory of the master node must be greater than 64 GB. + - **Cluster HA**: Use the default value. This parameter is not available in MRS 3.\ *x*. + - **Kerberos Authentication**: Select whether to enable Kerberos authentication. + - MRS 3.1.2-LTS.3 + + - **Username**: The default value is **root/admin**. User **root** is used to remotely log in to ECSs, and user **admin** is used to access the cluster management page. + - **Password**: Set a password for user **root**/**admin**. + - **Confirm Password**: Enter the password of user **root**/**admin** again. + + - Versions earlier than MRS 3.1.2-LTS.3 + + - **Username**: The default username is **admin**, which is used to log in to MRS Manager. + - **Password**: Set a password for user **admin**. + - **Confirm Password**: Enter the password of user **admin** again. + - **Key Pair**: Select a key pair from the drop-down list to log in to an ECS. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + +#. Select **Enable** to enable secure communications. For details, see :ref:`Communication Security Authorization `. + +#. **Click Create Now.** + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. Click **Access Cluster** to view cluster details. + + For details about cluster status during creation, see the description of the status parameters in :ref:`Table 1 `. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. + + On the MRS management console, a maximum of 10 clusters can be concurrently created, and a maximum of 100 clusters can be managed. diff --git a/umn/source/configuring_a_cluster/viewing_failed_mrs_tasks.rst b/umn/source/configuring_a_cluster/viewing_failed_mrs_tasks.rst new file mode 100644 index 0000000..0b1d312 --- /dev/null +++ b/umn/source/configuring_a_cluster/viewing_failed_mrs_tasks.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0043.html + +.. _mrs_01_0043: + +Viewing Failed MRS Tasks +======================== + +This section describes how to view and delete a failed MRS task. + +Background +---------- + +If a cluster fails to be created, terminated, scaled out, or scaled in, the **Manage Failed Tasks** page is displayed. Only the tasks that fail to be deleted are displayed on the **Cluster History** page. You can delete a failed task that is not required. + +Procedure +--------- + +#. Log in to the MRS console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. In the left navigation pane, choose **Clusters** > **Active Clusters**. + +#. Click |image2| or the number on the right of **Failed Tasks**. The **Manage Failed Tasks** page is displayed. + +#. In the **Operation** column of the cluster that you want to start, click **Delete**. + + In this step, only one job can be deleted. + +#. You can click **Delete All** in the upper left corner of the task list to delete all failed tasks. + +.. |image1| image:: /_static/images/en-us_image_0000001349057865.png +.. |image2| image:: /_static/images/en-us_image_0000001296058044.jpg diff --git a/umn/source/configuring_a_cluster/viewing_information_of_a_historical_cluster.rst b/umn/source/configuring_a_cluster/viewing_information_of_a_historical_cluster.rst new file mode 100644 index 0000000..18fb714 --- /dev/null +++ b/umn/source/configuring_a_cluster/viewing_information_of_a_historical_cluster.rst @@ -0,0 +1,98 @@ +:original_name: en-us_topic_0057514383.html + +.. _en-us_topic_0057514383: + +Viewing Information of a Historical Cluster +=========================================== + +Choose **Clusters > Cluster History** and click the name of a target cluster. You can view the cluster configuration and deployed node information. + +The following table describes the parameters for the historical cluster information. + +.. table:: **Table 1** Basic cluster information + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================+ + | Cluster Name | Name of a cluster. The cluster name is set when the cluster is created. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Status | Status of a cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Version | Cluster version | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Type | Type of the cluster to be created. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Obtaining a cluster ID | Unique identifier of a cluster, which is automatically assigned when a cluster is created | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Created | Time when a cluster is created. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | AZ | Availability zone (AZ) in the region of a cluster, which is set when a cluster is created. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Subnet | Subnet selected during cluster creation. | + | | | + | | A subnet provides dedicated network resources that are isolated from other networks, improving network security. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | VPC | VPC selected during cluster creation. | + | | | + | | A VPC is a secure, isolated, and logical network environment. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OBS Permission Control | Click **Manage** and modify the mapping between MRS users and OBS permissions. For details, see :ref:`Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS `. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Creating a data connection | Click **Manage** to view the data connection type associated with the cluster. For details, see :ref:`Configuring Data Connections `. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Agency | Click **Manage Agency** to bind or modify an agency for the cluster. | + | | | + | | An agency allows ECS or BMS to manage MRS resources. You can configure an agency of the ECS type to automatically obtain the AK/SK to access OBS. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. | + | | | + | | The **MRS_ECS_DEFAULT_AGENCY** agency has the OBSOperateAccess permission of OBS and the CESFullAccess (for users who have enabled fine-grained policies), CES Administrator, and KMS Administrator permissions in the region where the cluster is located. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Manager IP Address | Floating IP address for accessing Manager. | + | | | + | | .. note:: | + | | | + | | - The cluster manager IP address is displayed on the **Basic Information** page of the cluster with Kerberos authentication enabled instead of the cluster with Kerberos authentication disabled. | + | | - This parameter is valid only in versions earlier than MRS 1.9.2. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Key Pair | Name of a key pair. Set this parameter when creating a cluster. | + | | | + | | If the login mode is set to password during cluster creation, this parameter is not displayed. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kerberos Authentication | Whether to enable Kerberos authentication when logging in to Manager. | + | | | + | | .. note:: | + | | | + | | Kerberos authentication cannot be manually enabled or disabled after the cluster is created. Set this parameter with caution when creating a cluster. If you need to change the authentication status, you are advised to create a new cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Security Group | Security group name of the cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Disk Key Name | Name of the key used to encrypt data disks. To manage the used keys, log in to the key management console. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Disk Key ID | ID of the key used to encrypt data disks. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Component Version | Version of each component installed in the cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | License Version | License version of the cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Agency | Delegates ECSs or BMSs to manage some of your resources. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Go back to the historical clusters page. You can use the following buttons to perform operations. For details about the buttons, see the following table. + +.. table:: **Table 2** Icon description + + +----------+------------------------------------------------------------------------------------+ + | Icon | Description | + +==========+====================================================================================+ + | |image5| | Click |image6| to manually refresh the node information. | + +----------+------------------------------------------------------------------------------------+ + | |image7| | Enter a cluster name in the search bar and click |image8| to search for a cluster. | + +----------+------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001349057929.png +.. |image2| image:: /_static/images/en-us_image_0000001349057929.png +.. |image3| image:: /_static/images/en-us_image_0000001348738129.png +.. |image4| image:: /_static/images/en-us_image_0000001295738316.png +.. |image5| image:: /_static/images/en-us_image_0000001349057929.png +.. |image6| image:: /_static/images/en-us_image_0000001349057929.png +.. |image7| image:: /_static/images/en-us_image_0000001348738129.png +.. |image8| image:: /_static/images/en-us_image_0000001295738316.png diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_agency.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_agency.rst new file mode 100644 index 0000000..b732605 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_agency.rst @@ -0,0 +1,302 @@ +:original_name: mrs_01_0768.html + +.. _mrs_01_0768: + +Configuring a Storage-Compute Decoupled Cluster (Agency) +======================================================== + +MRS allows you to store data in OBS and use an MRS cluster for data computing only. In this way, storage and compute are separated. You can create an IAM agency, which enables ECS to automatically obtain the temporary AK/SK to access OBS. This prevents the AK/SK from being exposed in the configuration file. + +By binding an agency, ECSs or BMSs can manage some of your resources. Determine whether to configure an agency based on the actual service scenario. + +MRS provides the following configuration modes for accessing OBS. You can select one of them. The agency mode is recommended. + +- Bind an agency of the ECS type to an MRS cluster to access OBS, preventing the AK/SK from being exposed in the configuration file. For details, see the following part in this section. +- Configure the AK/SK in an MRS cluster. The AK/SK will be exposed in the configuration file in plaintext. Exercise caution when performing this operation. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +This function is available for components Hadoop, Hive, Spark, Presto, and Flink in clusters of . + +.. _mrs_01_0768__section092413322482: + +(Optional) Step 1: Create an ECS Agency with OBS Access Permissions +------------------------------------------------------------------- + +.. note:: + + - MRS presets **MRS_ECS_DEFAULT_AGENCY** in the agency list of IAM so that you can select this agency when creating a cluster. This agency has the **OBSOperateAccess** permission and the **CESFullAccess** (only available for users who have enabled fine-grained policies), **CES Administrator**, and **KMS Administrator** permissions in the region where the cluster is located. Do not modify **MRS_ECS_DEFAULT_AGENCY** on IAM. + - If you want to use the preset agency, skip the step for creating an agency. If you want to use a custom agency, perform the following steps to create an agency. (To create or modify an agency, you must have the Security Administrator permission.) + +#. Log in to the management console. +#. Choose **Service List** > **Management & Governance** > **Identity and Access Management**. +#. Choose **Agencies**. On the displayed page, click **Create Agency**. +#. Enter an agency name, for example, **mrs_ecs_obs**. +#. Set **Agency Type** to **Cloud service** and select **ECS BMS** to authorize ECS or BMS to invoke OBS. +#. Set **Validity Period** to **Unlimited** and click **Next**. +#. On the displayed page, search for the **OBS OperateAccess** and select it. +#. Click **Next**. On the displayed page, select the desired scope for permissions you selected. By default, **All resources** is selected. Click **Show More** and select **Global resources**. +#. In the dialog box that is displayed, click **OK** to start authorization. After the message "**Authorization successful.**" is displayed, click **Finish**. The agency is successfully created. + +Step 2: Create a Cluster with Storage and Compute Separated +----------------------------------------------------------- + +You can configure an agency when creating a cluster or bind an agency to an existing cluster to separate storage and compute. This section uses a cluster with Kerberos authentication enabled as an example. + +**Configuring an agency when creating a cluster**: + +#. Log in to the MRS management console. + +#. Click **Create Cluster**. The page for creating a cluster is displayed. + +#. Click the **Custom Config** tab. + +#. On the **Custom Config** tab page, set software parameters. + + - **Region**: Select a region as required. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing. + - Cluster Version: Select a cluster version. + - **Metadata**: Select **Local**. + +#. Click **Next** and set hardware parameters. + + - **AZ**: Use the default value. + - **VPC**: Use the default value. + - **Subnet**: Use the default value. + - **Security Group**: Use the default value. + - **EIP**: Use the default value. + - **Cluster Node**: Select the number of cluster nodes and node specifications based on site requirements. + +#. Click **Next** and set related parameters. + + - **Kerberos Authentication**: This function is enabled by default. You can enable or disable it. + - **Username**: The default username is **admin**, which is used to log in to MRS Manager. + - **Password**: Set a password for user **admin**. + - **Confirm Password**: Enter the password of user **admin** again. + - **Key Pair**: Select a key pair from the drop-down list to log in to an ECS. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + +#. In this example, configure an agency and leave other parameters blank. For details about how to configure other parameters, see :ref:`(Optional) Advanced Configuration `. + + **Agency**: Select the agency created in :ref:`(Optional) Step 1: Create an ECS Agency with OBS Access Permissions ` or **MRS_ECS_DEFAULT_AGENCY** preset in IAM. + +#. To enable secure communications, select **Enable**. For details, see :ref:`Communication Security Authorization `. + +#. Click **Apply Now** and wait until the cluster is created. + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +**Configuring an agency for an existing cluster**: + +#. Log in to the MRS management console. In the left navigation pane, choose **Clusters** > **Active Clusters**. +#. Click the name of the cluster to enter its details page. +#. On the **Dashboard** page, click **Synchronize** on the right of **IAM User Sync** to synchronize IAM users. +#. On the **Dashboard** tab page, click **Manage Agency** on the right side of **Agency** to select an agency and click **OK** to bind it. Alternatively, click **Create Agency** to go to the IAM console to create an agency and select it. + +Step 3: Create an OBS File System for Storing Data +-------------------------------------------------- + +.. note:: + + In the big data decoupled storage-compute scenario, the OBS parallel file system must be used to configure a cluster. Using common object buckets will greatly affect the cluster performance. + +#. Log in to OBS Console. + +#. Choose **Parallel File System** > **Create Parallel File System**. + +#. Enter the file system name, for example, **mrs-word001**. + + Set other parameters as required. + +#. Click **Create Now**. + +#. In the parallel file system list on the OBS console, click the file system name to go to the details page. + +#. In the navigation pane, choose **Files** and create the **program** and **input** folders. + + - **program**: Upload the program package to this folder. + - **input**: Upload the input data to this folder. + +Step 4: Accessing the OBS File System +------------------------------------- + +#. Log in to a Master node as user **root**. For details, see :ref:`Logging In to an ECS `. + +#. Run the following command to set the environment variables: + + For versions earlier than MRS 3.x, run the **source /opt/client/bigdata_env** command. + + For MRS 3.x or later, run the **source /opt/Bigdata/client/bigdata_env** command. + +#. Verify that Hadoop can access OBS. + + a. View the list of files in the file system **mrs-word001**. + + **hadoop fs -ls obs://mrs-word001/** + + b. Check whether the file list is returned. If it is returned, OBS access is successful. + + + .. figure:: /_static/images/en-us_image_0000001296217708.png + :alt: **Figure 1** Returned file list + + **Figure 1** Returned file list + +#. Verify that Hive can access OBS. + + a. If Kerberos authentication has been enabled for the cluster, run the following command to authenticate the current user. The current user must have a permission to create Hive tables. For details about how to configure a role with a permission to create Hive tables, see :ref:`Creating a Role `. For details about how to create a user and bind a role to the user, see :ref:`Creating a User `. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** **MRS cluster user** + + Example: **kinit hiveuser** + + b. Run the client command of the Hive component. + + **beeline** + + c. Access the OBS directory in the beeline. For example, run the following command to create a Hive table and specify that data is stored in the **test_obs** directory of the file system **mrs-word001**: + + **create table test_obs(a int, b string) row format delimited fields terminated by "," stored as textfile location "obs://mrs-word001/test_obs";** + + d. Run the following command to query all tables. If table **test_obs** is displayed in the command output, OBS access is successful. + + **show tables;** + + + .. figure:: /_static/images/en-us_image_0000001348738105.png + :alt: **Figure 2** Returned table name + + **Figure 2** Returned table name + + e. Press **Ctrl+C** to exit the Hive beeline. + +#. Verify that Spark can access OBS. + + a. Run the client command of the Spark component. + + **spark-beeline** + + b. Access OBS in spark-beeline. For example, create table **test** in the **obs://mrs-word001/table/** directory. + + **create table test(id int) location 'obs://mrs-word001/table/';** + + c. Run the following command to query all tables. If table **test** is displayed in the command output, OBS access is successful. + + **show tables;** + + + .. figure:: /_static/images/en-us_image_0000001349057897.png + :alt: **Figure 3** Returned table name + + **Figure 3** Returned table name + + d. Press **Ctrl+C** to exit the Spark beeline. + +#. Verify that Presto can access OBS. + + - For normal clusters with Kerberos authentication disabled + + a. Run the following command to connect to the client: + + **presto_cli.sh** + + b. On the Presto client, run the following statement to create a schema and set **location** to an OBS path: + + **CREATE SCHEMA hive.demo01 WITH (location = 'obs://mrs-word001/presto-demo002/');** + + c. Create a table in the schema. The table data is stored in the OBS file system. The following is an example. + + **CREATE TABLE hive.demo.demo_table WITH (format = 'ORC') AS SELECT \* FROM tpch.sf1.customer;** + + + .. figure:: /_static/images/en-us_image_0000001349257377.png + :alt: **Figure 4** Return result + + **Figure 4** Return result + + d. Run **exit** to exit the client. + + - For security clusters with Kerberos authentication enabled + + a. .. _mrs_01_0768__li251015403210: + + Log in to MRS Manager and create a role with the Hive Admin Privilege permissions, for example, **prestorole**. For details about how to create a role, see :ref:`Creating a Role `. + + b. .. _mrs_01_0768__li55542531841: + + Create a user that belongs to the Presto and Hive groups and bind the role created in :ref:`6.a ` to the user, for example, **presto001**. For details about how to create a user, see :ref:`Creating a User `. + + c. Authenticate the current user. + + **kinit presto001** + + d. Download the user credential. + + #. For MRS 3.x earlier, on MRS Manager, choose **System** > **Manage User**. In the row of the new user, choose **More** > **Download Authentication Credential**. + + + .. figure:: /_static/images/en-us_image_0000001349057901.png + :alt: **Figure 5** Downloading the Presto user authentication credential + + **Figure 5** Downloading the Presto user authentication credential + + #. On FusionInsight Manager for MRS 3.x or later,, choose **System > Permission > User**. In the row that contains the newly added user, click **More > Download Authentication Credential**. + + + .. figure:: /_static/images/en-us_image_0000001296058088.png + :alt: **Figure 6** Downloading the Presto user authentication credential + + **Figure 6** Downloading the Presto user authentication credential + + e. .. _mrs_01_0768__li65281811161910: + + Decompress the downloaded user credential file, and save the obtained **krb5.conf** and **user.keytab** files to the client directory, for example, **/opt/Bigdata/client/Presto/**. + + f. .. _mrs_01_0768__li165280118198: + + Run the following command to obtain a user principal: + + **klist -kt /opt/Bigdata/client/Presto/user.keytab** + + g. For clusters with Kerberos authentication enabled, run the following command to connect to the Presto Server of the cluster: + + **presto_cli.sh --krb5-config-path {krb5.conf file path} --krb5-principal {user principal} --krb5-keytab-path {user.keytab file path} --user {presto username}** + + - **krb5.conf** file path: Replace it with the file path set in :ref:`6.e `, for example, **/opt/Bigdata/client/Presto/krb5.conf**. + - **user.keytab** file path: Replace it with the file path set in :ref:`6.e `, for example, **/opt/Bigdata/client/Presto/user.keytab**. + - **user principal**: Replace it with the result returned in :ref:`6.f `. + - **presto username**: Replace it with the name of the user created in :ref:`6.b `, for example, **presto001**. + + Example: presto_cli.sh --krb5-config-path /opt/Bigdata/client/Presto/krb5.conf --krb5-principal prest001@xxx_xxx_xxx_xxx.COM --krb5-keytab-path /opt/Bigdata/client/Presto/user.keytab --user presto001 + + h. On the Presto client, run the following statement to create a schema and set **location** to an OBS path: + + **CREATE SCHEMA hive.demo01 WITH (location = 'obs://mrs-word001/presto-demo002/');** + + i. Create a table in the schema. The table data is stored in the OBS file system. The following is an example. + + **CREATE TABLE hive.demo01.demo_table WITH (format = 'ORC') AS SELECT \* FROM tpch.sf1.customer;** + + + .. figure:: /_static/images/en-us_image_0000001296058084.png + :alt: **Figure 7** Return result + + **Figure 7** Return result + + j. Run **exit** to exit the client. + +#. Verify that Flink can access OBS. + + a. On the **Dashboard** page, click **Synchronize** on the right of **IAM User Sync** to synchronize IAM users. + + b. After user synchronization is complete, choose **Jobs** > **Create** on the cluster details page to create a Flink job. In **Parameters**, enter parameters in **--input --output ** format. You can click **OBS** to select a job input path, and enter a job output path that does not exist, for example, **obs://mrs-word001/output/**. + + c. On OBS Console, go to the output path specified during job creation. If the output directory is automatically created and contains the job execution results, OBS access is successful. + + + .. figure:: /_static/images/en-us_image_0000001390874236.png + :alt: **Figure 8** Flink job execution result + + **Figure 8** Flink job execution result + +Reference +--------- + +For details about how to control permissions to access OBS, see :ref:`Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS `. diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_ak_sk.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_ak_sk.rst new file mode 100644 index 0000000..b8c1aad --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/configuring_a_storage-compute_decoupled_cluster_ak_sk.rst @@ -0,0 +1,203 @@ +:original_name: mrs_01_0468.html + +.. _mrs_01_0468: + +Configuring a Storage-Compute Decoupled Cluster (AK/SK) +======================================================= + +In MRS 1.9.2 or later, OBS can be interconnected with MRS using **obs://**. Currently, Hadoop, Hive, Spark, Presto, and Flink are supported. HBase cannot use **obs://** to interconnect with OBS. + +MRS provides the following configuration modes for accessing OBS. You can select one of them. The agency mode is recommended. + +- Bind an agency of the ECS type to an MRS cluster to access OBS, preventing the AK/SK from being exposed in the configuration file. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. +- Configure the AK/SK in an MRS cluster. The AK/SK will be exposed in the configuration file in plaintext. Exercise caution when performing this operation. For details, see the following part in this section. + +.. note:: + + - To improve data write performance, change the value of the **fs.obs.buffer.dir** parameter of the corresponding service to a data disk directory. + - In the big data decoupled storage-compute scenario, the OBS parallel file system must be used to configure a cluster. Using common object buckets will greatly affect the cluster performance. + +Using Hadoop to Access OBS +-------------------------- + +- Add the following content to file **core-site.xml** in the HDFS directory (**$client_home/ HDFS/hadoop/etc/hadoop**) on the MRS client: + + .. code-block:: + + + fs.obs.access.key + ak + + + fs.obs.secret.key + sk + + + fs.obs.endpoint + obs endpoint + + + .. important:: + + AK and SK will be displayed as plaintext in the configuration file. Exercise caution when setting AK and SK in the file. + + After the configuration is added, you can directly access data on OBS without manually adding the AK/SK and endpoint. For example, run the following command to view the file list of the **test_obs_orc** directory in the **obs-test** file system: + + **hadoop fs -ls "obs://obs-test/test_obs_orc"** + +- Add AK/SK and endpoint to the command line to access data on OBS. + + **hadoop fs -Dfs.obs.endpoint=xxx -Dfs.obs.access.key=xx -Dfs.obs.secret.key=xx -ls "obs://obs-test/ test_obs_orc"** + +.. _mrs_01_0468__section1164714235144: + +Using Hive to Access OBS +------------------------ + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. In the configuration type drop-down box, switch **Basic Configurations** to **All Configurations**. + +#. Search for **fs.obs.access.key** and **fs.obs.secret.key** and set them to the AK and SK of OBS respectively. + + If the preceding two parameters cannot be found in the current cluster, choose **Hive > Customization** in the navigation tree on the left and add the two parameters to the customized parameter **core.site.customized.configs**. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. to restart the Hive service. + +#. Access the OBS directory in the beeline. For example, run the following command to create a Hive table and specify that data is stored in the **test_obs** directory in the **test-bucket** file system: + + **create table test_obs(a int, b string) row format delimited fields terminated by "," stored as textfile location "obs://test-bucket/test_obs";** + +Using Spark to Access OBS +------------------------- + +.. note:: + + SparkSQL depends on Hive. Therefore, when configuring OBS on Spark, you need to modify the OBS configuration used in :ref:`Using Hive to Access OBS `. + +- spark-beeline and spark-sql + + You can add the following OBS attributes to the shell to access OBS: + + .. code-block:: + + set fs.obs.endpoint=xxx + set fs.obs.access.key=xxx + set fs.obs.secret.key=xxx + +- spark-beeline + + The spark-beeline can access OBS by configuring service parameters on Manager. The procedure is as follows: + + #. Go to the Spark configuration page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services** > **Spark** > **Service Configuration**. + - For MRS 1.9.2 or later, click the cluster name on the MRS console and choose **Components** > **Spark** > **Service Configuration**. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations**. + + #. In the configuration type drop-down box, switch **Basic Configurations** to **All Configurations**. + + #. Choose **JDBCServer** > **OBS**, and set values for **fs.obs.access.key** and **fs.obs.secret.key**. + + If the preceding two parameters cannot be found in the current cluster, choose **JDBCServer** > **Customization** in the navigation tree on the left and add the two parameters to the customized parameter **spark.core-site.customized.configs**. + + + .. figure:: /_static/images/en-us_image_0000001295738100.png + :alt: **Figure 1** Parameters for adding an OBS + + **Figure 1** Parameters for adding an OBS + + #. Click **Save Configuration** and select **Restart the affected services or instances**. Restart the Spark service. + + #. Access OBS in **spark-beeline**. For example, access the **obs://obs-demo-input/table/** directory. + + **create table test(id int) location 'obs://obs-demo-input/table/';** + +- spark-sql and spark-submit + + The spark-sql can also access OBS by modifying the **core-site.xml** configuration file. + + The method of modifying the configuration file is the same when you use the spark-sql and spark-submit to submit a task to access OBS. + + Add the following content to **core-site.xml** in the Spark configuration folder (**$client_home/Spark/spark/conf**) on the MRS client: + + .. code-block:: + + + fs.obs.access.key + ak + + + fs.obs.secret.key + sk + + + fs.obs.endpoint + obs endpoint + + +Using Presto to Access OBS +-------------------------- + +#. Go to the cluster details page and choose **Components** > **Presto** > **Service Configuration**. + +#. In the configuration type drop-down box, switch **Basic Configurations** to **All Configurations**. + +#. Search for and configure the following parameters: + + - Set **fs.obs.access.key** to **AK**. + - Set **fs.obs.secret.key** to **SK**. + + If the preceding two parameters cannot be found in the current cluster, choose **Presto > Hive** in the navigation tree on the left and add the two parameters to the customized parameter **core.site.customized.configs**. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. to restart the Presto service. + +#. Choose **Components** > **Hive** > **Service Configuration**. + +#. In the configuration type drop-down box, switch **Basic Configurations** to **All Configurations**. + +#. Search for and configure the following parameters: + + - Set **fs.obs.access.key** to **AK**. + - Set **fs.obs.secret.key** to **SK**. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. to restart the Hive service. + +#. On the Presto client, run the following statement to create a schema and set **location** to an OBS path: + + **CREATE SCHEMA hive.demo WITH (location = 'obs://obs-demo/presto-demo/');** + +#. Create a table in the schema. The table data is stored in the OBS file system. The following is an example. + + **CREATE TABLE hive.demo.demo_table WITH (format = 'ORC') AS SELECT \* FROM tpch.sf1.customer;** + +Using Flink to Access OBS +------------------------- + +Add the following configuration to the Flink configuration file of the MRS client in **Client installation path/Flink/flink/conf/flink-conf.yaml**: + +.. code-block:: + + fs.obs.access.key:ak + fs.obs.secret.key: sk + fs.obs.endpoint: obs endpoint + +.. important:: + + AK and SK will be displayed as plaintext in the configuration file. Exercise caution when setting AK and SK in the file. + +After the configuration is added, you can directly access data on OBS without manually adding the AK/SK and endpoint. diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/index.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/index.rst new file mode 100644 index 0000000..675f600 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0440.html + +.. _mrs_01_0440: + +Configuring a Cluster with Storage and Compute Decoupled +======================================================== + +- :ref:`Introduction to Storage-Compute Decoupling ` +- :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` +- :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) ` +- :ref:`Using a Storage-Compute Decoupled Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction_to_storage-compute_decoupling + configuring_a_storage-compute_decoupled_cluster_agency + configuring_a_storage-compute_decoupled_cluster_ak_sk + using_a_storage-compute_decoupled_cluster/index diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/introduction_to_storage-compute_decoupling.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/introduction_to_storage-compute_decoupling.rst new file mode 100644 index 0000000..dcb6e26 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/introduction_to_storage-compute_decoupling.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0467.html + +.. _mrs_01_0467: + +Introduction to Storage-Compute Decoupling +========================================== + +In scenarios that require large storage capacity and elastic compute resources, MRS enables you to store data in OBS and use an MRS cluster for data computing only. In this way, storage and compute are separated. + +.. note:: + + In the big data decoupled storage-compute scenario, the OBS parallel file system must be used to configure a cluster. Using common object buckets will greatly affect the cluster performance. + +Process of using the storage-compute decoupling function: + +#. Configure a storage-compute decoupled cluster using either of the following methods (agency is recommended): + + - Bind an agency of the ECS type to an MRS cluster to access OBS, preventing the AK/SK from being exposed in the configuration file. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. + - Configure the AK/SK in an MRS cluster. The AK/SK will be exposed in the configuration file in plaintext. Exercise caution when performing this operation. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +#. Use the cluster. + + For details, see the following sections': + + - :ref:`Interconnecting Flink with OBS ` + - :ref:`Interconnecting Flume with OBS ` + - :ref:`Interconnecting HDFS with OBS ` + - :ref:`Interconnecting Hive with OBS ` + - :ref:`Interconnecting MapReduce with OBS ` + - :ref:`Interconnecting Spark2x with OBS ` + - :ref:`Interconnecting Sqoop with External Storage Systems ` diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/index.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/index.rst new file mode 100644 index 0000000..0d279ef --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/index.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0643.html + +.. _mrs_01_0643: + +Using a Storage-Compute Decoupled Cluster +========================================= + +- :ref:`Interconnecting Flink with OBS ` +- :ref:`Interconnecting Flume with OBS ` +- :ref:`Interconnecting HDFS with OBS ` +- :ref:`Interconnecting Hive with OBS ` +- :ref:`Interconnecting MapReduce with OBS ` +- :ref:`Interconnecting Spark2x with OBS ` +- :ref:`Interconnecting Sqoop with External Storage Systems ` +- :ref:`Interconnecting Hudi with OBS ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + interconnecting_flink_with_obs + interconnecting_flume_with_obs + interconnecting_hdfs_with_obs + interconnecting_hive_with_obs + interconnecting_mapreduce_with_obs + interconnecting_spark2x_with_obs + interconnecting_sqoop_with_external_storage_systems + interconnecting_hudi_with_obs diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flink_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flink_with_obs.rst new file mode 100644 index 0000000..03afa20 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flink_with_obs.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1288.html + +.. _mrs_01_1288: + +Interconnecting Flink with OBS +============================== + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +#. Log in to the Flink client installation node as the client installation user. + +#. Run the following command to initialize environment variables: + + **source ${client_home}/bigdata_env** + +#. Configure the Flink client properly. For details, see :ref:`Installing a Client (Version 3.x or Later) `. + +#. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command. + + **kinit** *Username* + +#. Explicitly add the OBS file system to be accessed in the Flink command line. + + **./bin/flink run --xxx ./config/FlinkCheckpointJavaExample.jar --chkPath** **obs://**\ *Name of the OBS parallel file system* + +.. note:: + + Flink jobs are running on Yarn. Before configuring Flink to interconnect with the OBS file system, ensure that the interconnection between Yarn and the OBS file system is normal. diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flume_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flume_with_obs.rst new file mode 100644 index 0000000..778ec3e --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_flume_with_obs.rst @@ -0,0 +1,86 @@ +:original_name: en-us_topic_0000001349137409.html + +.. _en-us_topic_0000001349137409: + +Interconnecting Flume with OBS +============================== + +This section applies to MRS 3.x or later. + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +#. Configure an agency. + + a. Log in to the MRS console. In the navigation pane on the left, choose **Clusters** > **Active Clusters**. + b. Click the name of a cluster to go to the cluster details page. + c. On the **Dashboard** page, click **Synchronize** on the right of **IAM User Sync** to synchronize IAM users. + d. Click **Manage Agency** on the right of **Agency**, select the target agency, and click **OK**. + +#. .. _en-us_topic_0000001349137409__li3279721151214: + + Create an OBS file system for storing data. + + a. Log in to the OBS console. + b. In the navigation pane on the left, choose **Parallel File Systems**. On the displayed page, click **Create Parallel File System**. + c. Enter the file system name, for example, **esdk-c-test-pfs1**, and set other parameters as required. Click **Create Now**. + d. In the parallel file system list on the OBS console, click the created file system name to go to its details page. + e. In the navigation pane on the left, choose **Files** and click **Create Folder** to create the **testFlumeOutput** folder. + +#. Prepare the **properties.properties** file and upload it to the **/opt/flumeInput** directory. + + a. .. _en-us_topic_0000001349137409__li192551317183716: + + Prepare the **properties.properties** file on the local host. Its content is as follows: + + .. code-block:: + + # source + server.sources = r1 + # channels + server.channels = c1 + # sink + server.sinks = obs_sink + # ----- define net source ----- + server.sources.r1.type = seq + server.sources.r1.spooldir = /opt/flumeInput + # ---- define OBS sink ---- + server.sinks.obs_sink.type = hdfs + server.sinks.obs_sink.hdfs.path = obs://esdk-c-test-pfs1/testFlumeOutput + server.sinks.obs_sink.hdfs.filePrefix = %[localhost] + server.sinks.obs_sink.hdfs.useLocalTimeStamp = true + # set file size to trigger roll + server.sinks.obs_sink.hdfs.rollSize = 0 + server.sinks.obs_sink.hdfs.rollCount = 0 + server.sinks.obs_sink.hdfs.rollInterval = 5 + #server.sinks.obs_sink.hdfs.threadsPoolSize = 30 + server.sinks.obs_sink.hdfs.fileType = DataStream + server.sinks.obs_sink.hdfs.writeFormat = Text + server.sinks.obs_sink.hdfs.fileCloseByEndEvent = false + + # define channel + server.channels.c1.type = memory + server.channels.c1.capacity = 1000 + # transaction size + server.channels.c1.transactionCapacity = 1000 + server.channels.c1.byteCapacity = 800000 + server.channels.c1.byteCapacityBufferPercentage = 20 + server.channels.c1.keep-alive = 60 + server.sources.r1.channels = c1 + server.sinks.obs_sink.channel = c1 + + .. note:: + + The value of **server.sinks.obs_sink.hdfs.path** is the OBS file system created in :ref:`2 `. + + b. Log in to the node where the Flume client is installed as user **root**. + + c. Create the **/opt/flumeInput** directory and create a customized **.txt** file in this directory. + + d. Log in to FusionInsight Manager. + + e. Choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the displayed page, click **Configurations** and then **Upload File** in the **Value** column corresponding to the **flume.config.file** parameter, upload the **properties.properties** file prepared in :ref:`3.a `, and click **Save**. + +#. View the result in the OBS system. + + a. Log in to the OBS console. + b. Click **Parallel File Systems** and go to the folder created in :ref:`2 ` to view the result. diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hdfs_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hdfs_with_obs.rst new file mode 100644 index 0000000..97c7c45 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hdfs_with_obs.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_1292.html + +.. _mrs_01_1292: + +Interconnecting HDFS with OBS +============================= + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +#. Log in to the node on which the HDFS client is installed as a client installation user. + +#. Run the following command to switch to the client installation directory. + + **cd ${client_home}** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, skip user authentication. + + **kinit** *Component service user* + +#. Explicitly add the OBS file system to be accessed in the HDFS command line. + + Example: + + - Run the following command to access the OBS file system: + + **hdfs dfs -ls obs://**\ *OBS parallel file system name*/*Path* + + - Run the following command to upload the **/opt/test.txt** file from the client node to the OBS file system path: + + **hdfs dfs -put /opt/test.txt obs://**\ *OBS parallel file system name/Path* + +.. note:: + + If a large number of logs are printed in the OBS file system, the read and write performance may be affected. You can adjust the log level of the OBS client as follows: + + **cd ${client_home}/HDFS/hadoop/etc/hadoop** + + **vi log4j.properties** + + Add the OBS log level configuration to the file as follows: + + **log4j.logger.org.apache.hadoop.fs.obs=WARN** + + **log4j.logger.com.obs=WARN** + + |image1| + +.. |image1| image:: /_static/images/en-us_image_0000001349257293.png diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hive_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hive_with_obs.rst new file mode 100644 index 0000000..ce7aee5 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hive_with_obs.rst @@ -0,0 +1,99 @@ +:original_name: mrs_01_1286.html + +.. _mrs_01_1286: + +Interconnecting Hive with OBS +============================= + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +When creating a table, set the table location to an OBS path. +------------------------------------------------------------- + +#. Log in to the client installation node as the client installation user. + +#. Run the following command to initialize environment variables: + + **source ${client_home}/bigdata_env** + +#. For a security cluster, run the following command to perform user authentication (the user must have the permission to perform Hive operations). If Kerberos authentication is not enabled for the current cluster, you do not need to run this command. + + **kinit** *User performing Hive operations* + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Hive** > **Configurations** > **All Configurations**. + + In the left navigation tree, choose **Hive** > **Customization**. In the customized configuration items, add **dfs.namenode.acls.enabled** to the **hdfs.site.customized.configs** parameter and set its value to **false**. + + |image1| + +#. Click **Save**. Click the **Dashboard** tab and choose **More** > **Restart Service**. In the **Verify Identity** dialog box that is displayed, enter the password of the current user, and click **OK**. In the displayed **Restart Service** dialog box, select **Restart upper-layer services** and click **OK**. Hive is restarted. + +#. Log in to the beeline client and set **Location** to the OBS file system path when creating a table. + + **beeline** + + For example, run the following command to create the table **test** in **obs://**\ *OBS parallel file system name*\ **/user/hive/warehouse/**\ *Database name*\ **/**\ *Table name*: + + **create table test(name string) location "obs://**\ *OBS parallel file system name*\ **/user/hive/warehouse/**\ *Database name*\ **/**\ *Table name*\ **";** + + .. note:: + + You need to add the component operator to the URL policy in the Ranger policy. Set the URL to the complete path of the object on OBS. Select the Read and Write permissions. + +Setting the Default Location of the Created Hive Table to the OBS Path +---------------------------------------------------------------------- + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. In the left navigation tree, choose **MetaStore** > **Customization**. Add **hive.metastore.warehouse.dir** to the **hive.metastore.customized.configs** parameter and set it to the OBS path. + +#. In the left navigation tree, choose **HiveServer** > **Customization**. Add **hive.metastore.warehouse.dir** to the **hive.metastore.customized.configs** and **hive.metastore.customized.configs** parameters, and set it to the OBS path. + +#. Save the configurations and restart Hive. + +#. Update the client configuration file. + + a. Run the following command to open **hivemetastore-site.xml** in the Hive configuration file directory on the client: + + **vim /opt/Bigdata/client/Hive/config/hivemetastore-site.xml** + + b. Change the value of **hive.metastore.warehouse.dir** to the corresponding OBS path. + + |image2| + +#. Log in to the beeline client, create a table, and check whether the location is the OBS path. + + **beeline** + + **create table test(name string);** + + **desc formatted test;** + + .. note:: + + If the database location points to HDFS, the table to be created in the database (without specifying the location) also points to HDFS. If you want to modify the default table creation policy, change the location of the database to OBS by performing the following operations: + + a. Run the following command to query the location of the database: + + **show create database** *obs_test*\ **;** + + |image3| + + b. Run the following command to change the database location: + + **alter database** *obs_test* **set location** *'*\ **obs://**\ *OBS parallel file system name*\ **/user/hive/warehouse/**\ *Database name*\ **'** + + Run the **show create database** *obs_test* command to check whether the database location points to OBS. + + |image4| + + c. Run the following command to change the table location: + + **alter table** *user_info* **set location** *'*\ **obs://**\ *OBS parallel file system name*\ **/user/hive/warehouse/**\ *Database name*\ **/**\ *Table name*\ **'** + + If the table contains data, migrate the original data file to the new location. + +.. |image1| image:: /_static/images/en-us_image_0000001440974397.png +.. |image2| image:: /_static/images/en-us_image_0000001295738104.png +.. |image3| image:: /_static/images/en-us_image_0000001348737917.png +.. |image4| image:: /_static/images/en-us_image_0000001296217540.png diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hudi_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hudi_with_obs.rst new file mode 100644 index 0000000..61c376c --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_hudi_with_obs.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_24171.html + +.. _mrs_01_24171: + +Interconnecting Hudi with OBS +============================= + +#. Log in to the client installation node as the client installation user. + +#. Run the following commands to configure environment variables: + + **source ${client_home}/bigdata_env** + + **source ${client_home}/Hudi/component_env** + +#. Modify the configuration file: + + **vim ${client_home}/Hudi/hudi/conf/hdfs-site.xml** + + .. code-block:: + + + dfs.namenode.acls.enabled + false + + +#. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command. + + **kinit** *Username* + +#. Start spark-shell and run the following commands to create a COW table and save it in OBS: + + **import org.apache.hudi.QuickstartUtils.\_** + + **import scala.collection.JavaConversions.\_** + + **import org.apache.spark.sql.SaveMode.\_** + + **import org.apache.hudi.DataSourceReadOptions.\_** + + **import org.apache.hudi.DataSourceWriteOptions.\_** + + **import org.apache.hudi.config.HoodieWriteConfig.\_** + + **val tableName = "hudi_cow_table"** + + **val basePath = "obs://testhudi/cow_table/"** + + **val dataGen = new DataGenerator** + + **val inserts = convertToStringList(dataGen.generateInserts(10))** + + **val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))** + + **df.write.format("org.apache.hudi").** + + **options(getQuickstartWriteConfigs).** + + **option(PRECOMBINE_FIELD_OPT_KEY, "ts").** + + **option(RECORDKEY_FIELD_OPT_KEY, "uuid").** + + **option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").** + + **option(TABLE_NAME, tableName).** + + **mode(Overwrite).** + + **save(basePath);** + +#. Use DataSource to check whether the table is successfully created and whether the data is normal. + + **val roViewDF = spark.** + + **read.** + + **format("org.apache.hudi").** + + **load(basePath + "/*/*/*/*")** + + **roViewDF.createOrReplaceTempView("hudi_ro_table")** + + **spark.sql("select \* from hudi_ro_table").show()** + +#. Run the **:q** command to exit the spark-shell CLI. diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_mapreduce_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_mapreduce_with_obs.rst new file mode 100644 index 0000000..b490ff8 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_mapreduce_with_obs.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0617.html + +.. _mrs_01_0617: + +Interconnecting MapReduce with OBS +================================== + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +#. Log in to the MRS management console and click the cluster name to go to the cluster details page. + +#. Choose **Components > MapReduce**. The **All Configurations** page is displayed. In the navigation tree on the left, choose **MapReduce > Customization**. In the customized configuration items, add the configuration item **mapreduce.jobhistory.always-scan-user-dir** to **core-site.xml** and set its value to **true**. + + |image1| + +#. Save the configurations and restart the MapReduce service. + +.. |image1| image:: /_static/images/en-us_image_0000001349257365.png diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_spark2x_with_obs.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_spark2x_with_obs.rst new file mode 100644 index 0000000..dbd7c16 --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_spark2x_with_obs.rst @@ -0,0 +1,107 @@ +:original_name: mrs_01_1289.html + +.. _mrs_01_1289: + +Interconnecting Spark2x with OBS +================================ + +The OBS file system can be interconnected with Spark2x after an MRS cluster is installed. + +Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` or :ref:`Configuring a Storage-Compute Decoupled Cluster (AK/SK) `. + +Using Spark Beeline After Cluster Installation +---------------------------------------------- + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Spark2x** > **Configurations** > **All Configurations**. + + In the left navigation tree, choose **JDBCServer2x** > **Customization**. Add **dfs.namenode.acls.enabled** to the **spark.hdfs-site.customized.configs** parameter and set its value to **false**. + + |image1| + +#. Search for the **spark.sql.statistics.fallBackToHdfs** parameter and set its value to **false**. + + |image2| + +#. Save the configurations and restart the JDBCServer2x instance. + +#. Log in to the client installation node as the client installation user. + +#. Run the following command to configure environment variables: + + **source ${client_home}/bigdata_env** + +#. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command. + + **kinit** *Username* + +#. Access OBS in spark-beeline. For example, create a table named **test** in the **obs://mrs-word001/table/** directory. + + **create table test(id int) location '**\ *obs://mrs-word001/table/*\ **';** + +#. Run the following command to query all tables. If table **test** is displayed in the command output, OBS access is successful. + + **show tables;** + + + .. figure:: /_static/images/en-us_image_0000001349057877.png + :alt: **Figure 1** Verifying the created table name returned using Spark2x + + **Figure 1** Verifying the created table name returned using Spark2x + +#. Press **Ctrl+C** to exit the Spark Beeline. + +Using Spark SQL After Cluster Installation +------------------------------------------ + +#. Log in to the client installation node as the client installation user. + +#. Run the following command to configure environment variables: + + **source ${client_home}/bigdata_env** + +#. Modify the configuration file: + + **vim ${client_home}/Spark2x/spark/conf/hdfs-site.xml** + + .. code-block:: + + + dfs.namenode.acls.enabled + false + + +#. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command. + + **kinit** *Username* + +#. Access OBS in spark-sql. For example, create a table named **test** in the **obs://mrs-word001/table/** directory. + +#. Run the **cd** **${client_home}/Spark2x/spark/bin** command to access the **spark bin** directory and run **./spark-sql** to log in to spark-sql CLI. + +#. Run the following command in the spark-sql CLI: + + **create table test(id int) location '**\ *obs://mrs-word001/table/*\ **';** + +#. Run the **show tables;** command to confirm that the table is created successfully. + +#. Run **exit;** to exit the spark-sql CLI. + + .. note:: + + If a large number of logs are printed in the OBS file system, the read and write performance may be affected. You can adjust the log level of the OBS client as follows: + + **cd ${client_home}/Spark2x/spark/conf** + + **vi log4j.properties** + + Add the OBS log level configuration to the file as follows: + + **log4j.logger.org.apache.hadoop.fs.obs=WARN** + + **log4j.logger.com.obs=WARN** + + |image3| + +.. |image1| image:: /_static/images/en-us_image_0000001390934292.png +.. |image2| image:: /_static/images/en-us_image_0000001390455252.png +.. |image3| image:: /_static/images/en-us_image_0000001349257353.png diff --git a/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_sqoop_with_external_storage_systems.rst b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_sqoop_with_external_storage_systems.rst new file mode 100644 index 0000000..01438af --- /dev/null +++ b/umn/source/configuring_a_cluster_with_storage_and_compute_decoupled/using_a_storage-compute_decoupled_cluster/interconnecting_sqoop_with_external_storage_systems.rst @@ -0,0 +1,139 @@ +:original_name: mrs_01_24294.html + +.. _mrs_01_24294: + +Interconnecting Sqoop with External Storage Systems +=================================================== + +Exporting Data From HDFS to MySQL Using the **sqoop export** Command +-------------------------------------------------------------------- + +#. Log in to the node where the client is located. + +#. Run the following command to initialize environment variables: + + **source /opt/client/bigdata_env** + +#. Run the following command to operate the Sqoop client: + + **sqoop export --connect jdbc:mysql://10.100.231.134:3306/test --username root --password xxxxxx --table component13 -export-dir hdfs://hacluster/user/hive/warehouse/component_test3 --fields-terminated-by ',' -m 1** + + .. table:: **Table 1** Parameter description + + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +======================================+===========================================================================================================================================================================================================================================================================================================================================================================================+ + | -direct | Imports data to a relational database using a database import tool, for example, mysqlimport of MySQL, more efficient than the JDBC connection mode. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -export-dir | Specifies the source directory for storing data in the HDFS. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -m or -num-mappers | Starts *n* (4 by default) maps to import data concurrently. The value cannot be greater than the maximum number of maps in a cluster. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -table | Specifies the relational database table to be imported. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -update-key | Specifies the column used for updating the existing data in a relational database. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -update-mode | Specifies how updates are performed. The value can be **updateonly** or **allowinsert**. This parameter is used only when the relational data table does not contain the data record to be imported. For example, if the HDFS data to be imported to the destination table contains a data record **id=1** and the table contains an existing data record **id=2**, the update will fail. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -input-null-string | This parameter is optional. If it is not specified, **null** will be used. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -input-null-non-string | This parameter is optional. If it is not specified, **null** will be used. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -staging-table | Creates a table with the same data structure as the destination table for storing data before it is imported to the destination table. | + | | | + | | This parameter ensures the transaction security when data is imported to a relational database table. Due to multiple transactions during an import, this parameter can prevent other transactions from being affected when one transaction fails. For example, the imported data is incorrect or duplicate records exist. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -clear-staging-table | Clears data in the staging table before data is imported if the staging-table is not empty. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Importing Data from MySQL to Hive Using the sqoop import Command +---------------------------------------------------------------- + +#. Log in to the node where the client is located. + +#. Run the following command to initialize environment variables: + + **source /opt/client/bigdata_env** + +#. Run the following command to operate the Sqoop client: + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password xxxxxx --table component --hive-import --hive-table component_test2 --delete-target-dir --fields-terminated-by "," -m 1 --as-textfile** + + .. table:: **Table 2** Parameter description + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================================================================================================+ + | -append | Appends data to an existing dataset in the HDFS. Once this parameter is used, Sqoop imports data to a temporary directory, renames the temporary file where the data is stored, and moves the file to a formal directory to avoid duplicate file names in the directory. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -as-avrodatafile | Imports data to a data file in the Avro format. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -as-sequencefile | Imports data to a sequence file. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -as-textfile | Import data to a text file. After the text file is generated, you can run SQL statements in Hive to query the result. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -boundary-query | Specifies the SQL statement for performing boundary query. Before importing data, use a SQL statement to obtain a result set and import the data in the result set. The data format can be **-boundary-query 'select id,creationdate from person where id = 3'** (indicating a data record whose ID is 3) or **select min(), max() from **. | + | | | + | | The fields to be queried cannot contain fields whose data type is string. Otherwise, the error message "java.sql.SQLException: Invalid value for getLong()" is displayed. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -columns | Specifies the fields to be imported. The format is **-Column id,\ Username**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -direct | Imports data to a relational database using a database import tool, for example, mysqlimport of MySQL, more efficient than the JDBC connection mode. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -direct-split-size | Splits the imported streams by byte. Especially when data is imported from PostgreSQL using the direct mode, a file that reaches the specified size can be divided into several independent files. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -inline-lob-limit | Sets the maximum value of an inline LOB. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -m or -num-mappers | Starts *n* (4 by default) maps to import data concurrently. The value cannot be greater than the maximum number of maps in a cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -query, -e | Imports data from the query result. To use this parameter, you must specify the **-target-dir** and **-hive-table** parameters and use the query statement containing the WHERE clause as well as $CONDITIONS. | + | | | + | | Example: **-query'select \* from person where $CONDITIONS' -target-dir /user/hive/warehouse/person -hive-table person** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -split-by | Specifies the column of a table used to split work units. Generally, the column name is followed by the primary key ID. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -table | Specifies the relational database table from which data is obtained. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -target-dir | Specifies the HDFS path. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -warehouse-dir | Specifies the directory for storing data to be imported. This parameter is applicable when data is imported to HDFS but cannot be used when you import data to Hive directories. This parameter cannot be used together with **-target-dir**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -where | Specifies the WHERE clause when data is imported from a relational database, for example, **-where 'id = 2'**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -z,-compress | Compresses sequence, text, and Avro data files using the GZIP compression algorithm. Data is not compressed by default. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -compression-codec | Specifies the Hadoop compression codec. GZIP is used by default. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -null-string | Specifies the string to be interpreted as **NULL** for string columns. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -null-non-string | Specifies the string to be interpreted as null for non-string columns. If this parameter is not specified, **NULL** will be used. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -check-column (col) | Specifies the column for checking incremental data import, for example, **id**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -incremental (mode) append | Incrementally imports data. | + | | | + | or last modified | **append**: appends records, for example, appending records that are greater than the value specified by **last-value**. | + | | | + | | **lastmodified**: appends data that is modified after the date specified by **last-value**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -last-value (value) | Specifies the maximum value (greater than the specified value) of the column after the last import. This parameter can be set as required. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Sqoop Usage Example +------------------- + +- Importing data from MySQL to HDFS using the **sqoop import** command + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--query 'SELECT \* FROM component where $CONDITIONS and component_id ="MRS 1.0_002"' --target-dir /tmp/component_test --delete-target-dir --fields-terminated-by "," -m 1 --as-textfile** + +- Exporting data from OBS to MySQL using the **sqoop export** command + + **sqoop export --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component14 -export-dir obs://obs-file-bucket/xx/part-m-00000 --fields-terminated-by ',' -m 1** + +- Importing data from MySQL to OBS using the **sqoop import** command + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component --target-dir obs://obs-file-bucket/xx --delete-target-dir --fields-terminated-by "," -m 1 --as-textfile** + +- Importing data from MySQL to OBS tables outside Hive + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component --hive-import --hive-table component_test01 --fields-terminated-by "," -m 1 --as-textfile** diff --git a/umn/source/data_backup_and_restoration/hbase_data.rst b/umn/source/data_backup_and_restoration/hbase_data.rst new file mode 100644 index 0000000..73133cd --- /dev/null +++ b/umn/source/data_backup_and_restoration/hbase_data.rst @@ -0,0 +1,239 @@ +:original_name: mrs_01_0448.html + +.. _mrs_01_0448: + +HBase Data +========== + +Currently, HBase data can be backed up in the following modes: + +- Snapshots +- Replication +- Export +- CopyTable +- HTable API +- Offline backup of HDFS data + +:ref:`Table 1 ` compares the impact of operations from six perspectives. + +.. _mrs_01_0448__table163113513341: + +.. table:: **Table 1** Data backup mode comparison on HBase + + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | Backup Mode | Performance Impact | Data Footprint | Downtime | Incremental Backup | Ease of Implementation | Mean Time to Repair (MTTR) | + +=============================+====================+================+==========================+====================+========================+============================+ + | Snapshots | Minimal | Tiny | Brief (Only for Restore) | No | Easy | Seconds | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | Replication | Minimal | Large | None | Intrinsic | Medium | Seconds | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | Export | High | Large | None | Yes | Easy | High | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | CopyTable | High | Large | None | Yes | Easy | High | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | HTable API | Medium | Large | None | Yes | Difficult | Up to you | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + | Offline backup of HDFS data | ``-`` | Large | Long | No | Medium | High | + +-----------------------------+--------------------+----------------+--------------------------+--------------------+------------------------+----------------------------+ + +Snapshots +--------- + +You can perform the snapshot operation on a table to generate a snapshot for the table. The snapshot can be used to back up the original table, roll back the original table when the original table is faulty, as well as back up data cross clusters. After a snapshot is executed, the **.hbase-snapshot** directory is generated in the HBase root directory (**/hbase** by default) on HBase. The directory contains details about each snapshot. When the **ExportSnapshot** command is executed to export the snapshot, an MR task is submitted locally to copy the snapshot information and table's **HFile** to **/hbase/.hbase-snapshot** and **/hbase/archive** of the standby cluster respectively. For details, see http://hbase.apache.org/2.2/book.html#ops.snapshots. + +- This backup mode has the following advantages: + + The single table backup efficiency is high. Online data can be backed up locally or remotely without interrupting services of the active and standby clusters. The number of maps and traffic threshold can be flexibly configured. A MapReduce executor node does not need to be deployed in the active and standby clusters. Therefore, no resource is consumed. + +- This backup mode has the following disadvantages and limitations: + + Only a single table can be backed up. The name of the table to be backed up has been specified in the snapshot and cannot be changed. Incremental backup cannot be performed. Resources are consumed when an MR task runs. + +**Perform the following operations on the active cluster:** + +#. Create a snapshot for a table. For example, create snapshot **member_snapshot** for the **member** table. + + **snapshot 'member','member_snapshot'** + +#. Copy the snapshot to the standby cluster. + + **hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot member_snapshot -copy-to hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase -mappers 3** + + - The data directory of the standby cluster must be the HBase root directory (**/hbase**). + - **mappers** indicates the number of maps to be submitted for an MR task. + +**Perform the following operations on the standby cluster:** + +Run the **restore** command to automatically create a table in the standby cluster and establish a link between HFile in **archive** and the table. + +**restore_snapshot 'member_snapshot'** + +.. note:: + + If only table data needs to be backed up, Snapshots is highly recommended. Use SnapshotExport to submit an MR task locally and copies Snapshot and HFile to the standby cluster. Then, data can be directly loaded to the standby cluster, more efficient than using other methods. + +Replication +----------- + +In Replication backup mode, a disaster recovery relationship is established between the active and standby clusters on HBase. When data is written to the active cluster, the active cluster pushes data to the standby cluster through WAL to implement real-time data synchronization between the active and standby clusters. For details, see http://hbase.apache.org/2.2/book.html#_cluster_replication. + +- This backup mode has the following advantages: + + - Replication is different from other data backup modes. After the active/standby relationship between clusters is established, data can be synchronized in real time without manual operations. + - The backup operation consumes few cluster resources and has little impact on cluster performance. + - Data synchronization reliability is high. If the standby cluster is stopped for a while and then recovered, data generated during this period on the active cluster can be still synchronized to the standby cluster. + +- This backup mode has the following disadvantages and limitations: + + - If WAL is not set for data written by clients, data cannot be backed up to the standby cluster. + - The background synchronizes data in asynchronous mode, because only few resources can be occupied. Therefore, data is not synchronized in real time. + - If the data exists in the active cluster before you use the replication mode to perform synchronization, you need to use other methods to import these data to the standby cluster. + - If data is written to the active cluster in **bulkload** mode, it cannot be synchronized. (HBase on MRS enhances replication. Therefore, data written in the **bulkload** mode can be synchronized by replication.) + +For details about how to use and configure HBase backup, see `Configuring HBase Replication `__ and `Using the ReplicationSyncUp Tool `__. + +Export/Import +------------- + +Export/Import starts a MapReduce task to scan the data table and writes SequenceFile to the remote HDFS. Then, Import reads SequenceFile and puts it on HBase. + +- This backup mode has the following advantages: + + Online copy does not interrupt services. Because data is written to new tables in **scan-** > **put** mode, Export/Import is more flexible than CopyTable. Data to be obtained and used flexibly, and written incrementally. + +- This backup mode has the following disadvantages and limitations: + + Export writes SequenceFiles to the remote HDFS through a MapReduce task, and then Import reads SequenceFiles and puts them on HBase. Therefore, an MR task needs to be executed twice, thus being inefficient. + +**Perform the following operations on the active cluster:** + +Run the **Export** command to export the table. + +**hbase org.apache.hadoop.hbase.mapreduce.Export ** + +Example: **hbase org.apache.hadoop.hbase.mapreduce.Export member hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/user/table/member** + +In the command, **member** indicates the name of the table to be exported. + +**Perform the following operations on the standby cluster:** + +#. After operations are executed on the active cluster, you can view the generated directory data on the standby cluster, as shown in :ref:`Figure 1 `. + + .. _mrs_01_0448__fig148041121174318: + + .. figure:: /_static/images/en-us_image_0000001349137569.png + :alt: **Figure 1** Directory data + + **Figure 1** Directory data + +#. Run the **create** command to create a table in the standby cluster with the same structure as that of the active cluster, for example, **member_import**. + +#. .. _mrs_01_0448__li186481362121: + + Run the **Import** command to generate the HFile data on HDFS. + + **hbase org.apache.hadoop.hbase.mapreduce.Import ** + + Example: **hbase org.apache.hadoop.hbase.mapreduce.Import member_import /user/table/member -Dimport.bulk.output=/tmp/member** + + - **member_import** indicates a table in the standby cluster with the same table structure as that of the active cluster. + - **Dimport.bulk.output** indicates the output directory of the HFile data. + - **/user/table/member** indicates the directory for storing data exported from the active cluster. + +#. Perform the **Load** operation to write the HFile data to HBase. + + **hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/member member** + + - **/tmp/member** indicates the output directory of the HFile data in :ref:`3 `. + - **member** indicates the name of the table to which data is to be imported in the standby cluster. + +CopyTable +--------- + +The function of CopyTable is similar to that of Export. Like Export, CopyTable uses HBase API to create a MapReduce task to read data from the source table. However, the difference is that the output of CopyTable is an HBase table that can be stored in a local or remote cluster. For details, see http://hbase.apache.org/2.2/book.html#copy.table + +- This backup mode has the following advantages: + + The operation is simple. Online copy does not interrupt services. You can specify the **startrow**, **endrow**, and **timestamp** parameters of the backup data. + +- This backup mode has the following disadvantages and limitations: + + Only a single table can be operated. The efficiency is low when a large amount of data is remotely copied. The MapReduce task consumes local resources. The number of maps of the MapReduce task is determined by the number of regions in the table. + +**Perform the following operations on the standby cluster:** + +Run the **create** command to create a table in the standby cluster with the same structure as that of the active cluster, for example, **member_copy**. + +**Perform the following operations on the active cluster:** + +Run the following CopyTable command to copy the table: + +**hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=xxxxxx] [--endtime=xxxxxx] --new.name=member_copy --peer.adr=server1,server2,server3:2181:/hbase [--families=myOldCf:myNewCf,cf2,cf3] TestTable** + +- **starttime/endtime** indicates the timestamp of the data to be copied. +- **new.name** indicates the name of the destination table in the standby cluster. The default name of the new table is the same as that of the original table. +- **peer.adr** indicates the information about the ZooKeeper node in the standby cluster. The format is **quorumer:port:/hbase**. +- **families** indicates the family column of the table to be copied. + +.. note:: + + If data is copied to a remote cluster, a MapReduce task is submitted on the host cluster to import the data. After the full or partial data in the original table is read, it is written to the remote cluster in **put** mode. Therefore, if the table contains a large amount of data (remote copy does not support **bulkload**), the efficiency is unsatisfactory. + +HTable API +---------- + +HTable API imports and exports data of the original HBase table in the code. You can use the public API to write customized client applications to directly query tables, or design other methods based on the batch processing advantages of MapReduce tasks. This mode requires in-depth understanding of Hadoop development and the impact on the production cluster. + +Offline backup of HDFS data +--------------------------- + +Offline backup of HDFS data means stopping the HBase service and allowing users to manually copy the HDFS data. + +- This backup mode has the following advantages: + + - All data (including metadata) in the active cluster can be copied to the standby cluster at a time. + - Data is directly copied by DistCp. Therefore, the data backup efficiency is relatively high. + + - You can copy data based on the site requirements. You can copy data of only one table or copy one HFile in a region. + +- This backup mode has the following disadvantages and limitations: + + - This operation will overwrite the HDFS data directory in the standby cluster. + - If the HBase versions of the active and standby clusters are different, an error may occur when the HDFS directory is directly copied. For example, if the system table **index** is added to the MRS **hbase1.3** and overwritten by the HDFS directory of the earlier version, the table cannot be found. Therefore, exercise caution when using this mode. + - This operation has certain requirements on HBase capabilities. If an exception occurs, restore HBase based on the site requirements. + +**Perform the following operations on the active cluster:** + +#. Run the following command to save the data in the current cluster to HDFS permanently: + + **flush 'tableName'** + +#. Stop the HBase service. + +#. Run the following commands to copy the HDFS data of the current cluster to the standby cluster: + + **hadoop distcp -i /hbase/data hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase** + + **hadoop distcp -update -append -delete /hbase/ hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase/** + + The second command is used to incrementally copy files except the data directory. For example, data in **archive** may be referenced by the data directory. + +**Perform the following operations on the standby cluster:** + +#. Restart the HBase service for the data migration to take effect. During the restart, HBase loads the data in the current HDFS and regenerates metadata. + +#. After the restart is complete, run the following command on the Master node client to load the HBase table data: + + .. code-block:: + + $HBase_Home/bin/hbase hbck -fixMeta -fixAssignments + +#. After the command is executed, run the following command repeatedly to check the health status of the HBase cluster until the health status is normal: + + .. code-block:: + + hbase hbck + + .. note:: + + If the HBase coprocessor is used and custom JAR files are stored in the **regionserver/hmaster** of the active cluster, you need to copy the custom JAR files before restarting the HBase service on the standby cluster. diff --git a/umn/source/data_backup_and_restoration/hdfs_data.rst b/umn/source/data_backup_and_restoration/hdfs_data.rst new file mode 100644 index 0000000..983707e --- /dev/null +++ b/umn/source/data_backup_and_restoration/hdfs_data.rst @@ -0,0 +1,81 @@ +:original_name: mrs_01_0445.html + +.. _mrs_01_0445: + +HDFS Data +========= + +.. _mrs_01_0445__section2349182854814: + +Establishing a Data Transmission Channel +---------------------------------------- + +- If the source cluster and destination cluster are deployed in different VPCs in the same region, create a network connection between the two VPCs to establish a data transmission channel at the network layer. For details, see **Virtual Private Cloud > User Guide > VPC Peering Connection**. +- If the source cluster and destination cluster are deployed in the same VPC but belong to different security groups, add security group rules to each security group on the VPC management console. In the security rules, **Protocol** is set to **ANY**, **Transfer Direction** is set to **Inbound**, and **Source** is set to **Security Group** (the security group of the peer cluster). + + - To add an inbound rule to the security group of the source cluster, select the security group of the destination cluster in **Source**. + - To add an inbound rule to the security group of the destination cluster, select the security group of the source cluster in **Source**. + +- If the source and destination clusters are deployed in the same security group of the same VPC and Kerberos authentication is enabled for both clusters, configure mutual trust between the two clusters. + +Backing Up HDFS Data +-------------------- + +Based on the regions of and network connectivity between the source cluster and destination cluster, data backup scenarios are classified as follows: + +- .. _mrs_01_0445__li529181519336: + + Same Region + + If the source cluster and destination cluster are in the same region, set up a network transmission channel. Use the DistCp tool to run the following command to copy the HDFS, HBase, Hive data files and Hive metadata backup files from the source cluster to the destination cluster. + + .. code-block:: + + $HADOOP_HOME/bin/hadoop distcp -p + + The following provides description about the parameters in the preceding command. + + - **$HADOOP_HOME**: installation directory of the Hadoop client in the destination cluster + - ****: HDFS directory of the source cluster + - ****: HDFS directory of the destination cluster + +- Different Regions + + If the source cluster and destination cluster are in different regions, use the DistCp tool to copy the source cluster data to OBS, and use the OBS cross-region replication function (For details, see **Object Storage Service > Console Operation Guide > Cross-Region Replication**) to copy the data to OBS in the region where the destination cluster resides. If DistCp is used, permission, owner, and group information cannot be set for files on OBS. In this case, you need to export and copy the HDFS metadata while exporting data to prevent the loss of HDFS file property information. + +- Migrating Data from an Offline Cluster to a Cloud + + You can use the following way to migrate data from an offline cluster to the cloud. + + - Direct Connect + + Create a Direct Connect between the source cluster and destination cluster, enable the network between the offline cluster egress gateway and the online VPC, and execute the DistCp to copy the data by referring to the method provided in :ref:`Same Region `. + +Backing Up HDFS Metadata +------------------------ + +HDFS metadata information to be exported includes file and folder permissions and owner/group information. You can run the following command on the HDFS client to export the metadata: + +.. code-block:: + + $HADOOP_HOME/bin/hdfs dfs -ls -R > /tmp/hdfs_meta.txt + +The following provides description about the parameters in the preceding command. + +- **$HADOOP_HOME**: installation directory of the Hadoop client in the source cluster +- ****: HDFS data directory to be migrated +- **/tmp/hdfs_meta.txt**: local path for storing the exported metadata + +.. note:: + + If the source cluster can communicate with the destination cluster and you run the **hadoop distcp** command as a super administrator to copy data, you can add the **-p** parameter to enable DistCp to restore the metadata of the corresponding file in the destination cluster while copying data. In this case, skip this step. + +HDFS File Property Restoration +------------------------------ + +Based on the exported permission information, run the HDFS commands in the background of the destination cluster to restore the file permission and owner and group information. + +.. code-block:: + + $HADOOP_HOME/bin/hdfs dfs -chmod + $HADOOP_HOME/bin/hdfs dfs -chown diff --git a/umn/source/data_backup_and_restoration/hive_data.rst b/umn/source/data_backup_and_restoration/hive_data.rst new file mode 100644 index 0000000..3b59cb7 --- /dev/null +++ b/umn/source/data_backup_and_restoration/hive_data.rst @@ -0,0 +1,8 @@ +:original_name: mrs_01_0447.html + +.. _mrs_01_0447: + +Hive Data +========= + +Hive data is not backed up independently. For details, see :ref:`HDFS Data `. diff --git a/umn/source/data_backup_and_restoration/hive_metadata.rst b/umn/source/data_backup_and_restoration/hive_metadata.rst new file mode 100644 index 0000000..2ef0f1e --- /dev/null +++ b/umn/source/data_backup_and_restoration/hive_metadata.rst @@ -0,0 +1,50 @@ +:original_name: mrs_01_0446.html + +.. _mrs_01_0446: + +Hive Metadata +============= + +Backing Up Hive Metadata +------------------------ + +Hive table data is stored in HDFS. Table data and the metadata of the table data is centrally migrated in directories by HDFS in a unified manner. Metadata of Hive tables can be stored in different types of relational databases (such as MySQL, PostgreSQL, and Oracle) based on cluster configurations. The exported metadata of the Hive tables in this document is the Hive table description stored in the relational database. + +The mainstream big data release editions in the industry support Sqoop installation. For on-premises big data clusters of the community version, you can download the Sqoop of the community version for installation. Use Sqoop to decouple the strong dependency between the metadata to be exported and the relational database and export Hive metadata to HDFS and migrate it together with the table data for restoration. The procedure is as follows: + +#. Download the Sqoop tool from the source cluster and install it. For details, see http://sqoop.apache.org/. + +#. Download the JDBC driver of the relational database to the **$Sqoop_Home/lib** directory. + +#. Run the following command to export all Hive metadata tables: All exported data is stored in the **/user//** directory on HDFS. + + .. code-block:: + + $Sqoop_Home/bin/sqoop import --connect jdbc:://:/ --table --username -password -m 1 + + The following provides description about the parameters in the preceding command. + + - **$Sqoop_Home**: Sqoop installation directory + - ****: Database type + - ****: IP address of the database in the source cluster + - ****: Port number of the database in the source cluster + - ****: Name of the table to be exported + - ****: Username + - ****: User password + +Hive Metadata Restoration +------------------------- + +Install Sqoop and run the Sqoop command in the destination cluster to import the exported Hive metadata to DBService in the MRS cluster. + +.. code-block:: + + $Sqoop_Home/bin/sqoop export --connect jdbc:postgresql://:20051/hivemeta --table --username hive -password --export-dir + +The following provides description about the parameters in the preceding command. + +- **$Sqoop_Home**: Sqoop installation directory in the destination cluster +- ****: IP address of the database in the destination cluster +- ****: Name of the table to be restored +- ****: Password of user **hive** +- *<*\ **export_from>**: HDFS address of the metadata in the destination cluster diff --git a/umn/source/data_backup_and_restoration/index.rst b/umn/source/data_backup_and_restoration/index.rst new file mode 100644 index 0000000..2adf40b --- /dev/null +++ b/umn/source/data_backup_and_restoration/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_0444.html + +.. _mrs_01_0444: + +Data Backup and Restoration +=========================== + +- :ref:`HDFS Data ` +- :ref:`Hive Metadata ` +- :ref:`Hive Data ` +- :ref:`HBase Data ` +- :ref:`Kafka Data ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hdfs_data + hive_metadata + hive_data + hbase_data + kafka_data diff --git a/umn/source/data_backup_and_restoration/kafka_data.rst b/umn/source/data_backup_and_restoration/kafka_data.rst new file mode 100644 index 0000000..7a3f1e4 --- /dev/null +++ b/umn/source/data_backup_and_restoration/kafka_data.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_0449.html + +.. _mrs_01_0449: + +Kafka Data +========== + +MirrorMaker is a powerful tool for Kafka data synchronization. It is used when data needs to be synchronized between two Kafka clusters or when data in the original Kafka cluster needs to be migrated to a new Kafka cluster. MirrorMaker is a built-in tool in Kafka. It actually integrates the functions of Kafka Consumer and Producer. MirrorMaker can read data from one Kafka cluster and write the data to another Kafka cluster to implement data synchronization between Kafka clusters. + +This section describes how to use the MirrorMaker tool provided by MRS to synchronize and migrate Kafka cluster data. Before migrating Kafka data, ensure that the two clusters can communicate with each other by following the instructions provided in :ref:`Establishing a Data Transmission Channel `. + +Procedure +--------- + +**Versions earlier than MRS 3.x:** + +#. .. _mrs_01_0449__li1980875616292: + + Enable the Kerberos authentication for clusters. + +#. If you plan to use the MirrorMaker tool in a source cluster, log in to MRS Manager of a destination cluster and choose **Services**. If you plan to use the MirrorMaker tool in a destination cluster, log in to MRS Manager of a source cluster and choose **Services**. + +#. Choose **Kafka** > **Service Configuration**, and change **Basic** to **All** in the parameter type drop-down box. + +#. Click **Broker** > **Customization** and add the following rules on the displayed page: + + **sasl.kerberos.principal.to.local.rules = RULE:[1:$1@$0](.*@XXXYYYZZZ.COM)s/@.*//,RULE:[2:$1@$0](.*@ XXXYYYZZZ.COM)s/@.*//,DEFAULT** + + In the preceding rule, **XXXYYYZZZ.COM** indicates the domain name of the cluster (source cluster) where data resides. The domain name must be spelled in uppercase letters. + +#. .. _mrs_01_0449__li854919509282: + + Click **Save Configuration** and select **Restart the affected services or instances**. Click **Yes** to restart the Kafka service. + + .. note:: + + For a security cluster with the Kerberos authentication enabled, perform :ref:`1 ` to :ref:`5 `. For a normal cluster with the Kerberos authentication disabled, skip :ref:`1 ` to :ref:`5 ` and go to :ref:`6 `. + +#. .. _mrs_01_0449__li3402143084520: + + In the cluster that uses the MirrorMaker tool, go to the cluster details page and choose **Services**. + +#. Choose **Kafka** > **Service Configuration**, change **Basic** to **All** in the parameter type drop-down box, and change **All Roles** to **MirrorMaker**. + + Parameter description: + + - The **bootstrap.servers** parameter in the **source** and **dest** tags indicates the **broker** node list and port information of the source and destination Kafka clusters respectively. + - Set parameter **security.protocol** in the **source** and **dest** tags based on the actual configurations of the source and destination Kafka clusters. + - If the source Kafka cluster or destination Kafka cluster is a security cluster, you need to set **kerberos.domain.name** and **sasl.kerberos.service.name** in the **source** and **dest** tags. If the local host is used, you do not need to set **kerberos.domain.name**. If the local host is not used, set **kerberos.domain.name** and **sasl.kerberos.service.name** based on the site requirements. The default value of **sasl.kerberos.service.name** is **kafka**. + - Set **whitelist** in the **mirror** tag, that is, the name of the topic to be synchronized. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. Click **Yes** to restart the MirrorMaker instance. + + After MirrorMaker is restarted, the data migration task is started. diff --git a/umn/source/faq/account_and_password/how_do_i_query_and_change_the_password_validity_period_of_an_account.rst b/umn/source/faq/account_and_password/how_do_i_query_and_change_the_password_validity_period_of_an_account.rst new file mode 100644 index 0000000..0561230 --- /dev/null +++ b/umn/source/faq/account_and_password/how_do_i_query_and_change_the_password_validity_period_of_an_account.rst @@ -0,0 +1,71 @@ +:original_name: mrs_03_1249.html + +.. _mrs_03_1249: + +How Do I Query and Change the Password Validity Period of an Account? +===================================================================== + +Querying the Password Validity Period +------------------------------------- + +**Querying the password validity period of a component running user (human-machine user or machine-machine user):** + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command and enter the password of user **kadmin/admin** to log in to the kadmin console: + + **kadmin -p kadmin/admin** + + .. note:: + + The default password of user **kadmin/admin** is **Admin@123**. Change the password upon your first login or as prompted and keep the new password secure. + +#. Run the following command to view the user information: + + **getprinc** *Internal system username* + + Example: **getprinc user1** + + .. code-block:: + + kadmin: getprinc user1 + ...... + Expiration date: [never] + Last password change: Sun Oct 09 15:29:54 CST 2022 + Password expiration date: [never] + ...... + +**Querying the password validity period of an OS user:** + +#. Log in to any master node in the cluster as user **root**. + +#. Run the following command to view the password validity period (value of **Password expires**): + + **chage -l** *Username* + + For example, to view the password validity period of user **root**, run the **chage -l** **root** command. The command output is as follows: + + .. code-block:: console + + [root@xxx ~]#chage -l root + Last password change : Sep 12, 2021 + Password expires : never + Password inactive : never + Account expires : never + Minimum number of days between password change : 0 + Maximum number of days between password change : 99999 + Number of days of warning before password expires : 7 + +Changing the Password Validity Period +------------------------------------- + +- The password of a machine-machine user is randomly generated and never expires by default. +- The password validity period of a human-machine user can be changed by modifying the password policy on Manager. diff --git a/umn/source/faq/account_and_password/index.rst b/umn/source/faq/account_and_password/index.rst new file mode 100644 index 0000000..137a092 --- /dev/null +++ b/umn/source/faq/account_and_password/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_2003.html + +.. _mrs_03_2003: + +Account and Password +==================== + +- :ref:`What Is the Account for Logging In to Manager? ` +- :ref:`How Do I Query and Change the Password Validity Period of an Account? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + what_is_the_account_for_logging_in_to_manager + how_do_i_query_and_change_the_password_validity_period_of_an_account diff --git a/umn/source/faq/account_and_password/what_is_the_account_for_logging_in_to_manager.rst b/umn/source/faq/account_and_password/what_is_the_account_for_logging_in_to_manager.rst new file mode 100644 index 0000000..ec937e7 --- /dev/null +++ b/umn/source/faq/account_and_password/what_is_the_account_for_logging_in_to_manager.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1027.html + +.. _mrs_03_1027: + +What Is the Account for Logging In to Manager? +============================================== + +The default account for logging in to Manager is **admin**, and the password is the one you set when you created the cluster. diff --git a/umn/source/faq/accounts_and_permissions/does_an_mrs_cluster_support_access_permission_control_if_kerberos_authentication_is_not_enabled.rst b/umn/source/faq/accounts_and_permissions/does_an_mrs_cluster_support_access_permission_control_if_kerberos_authentication_is_not_enabled.rst new file mode 100644 index 0000000..545c8d2 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/does_an_mrs_cluster_support_access_permission_control_if_kerberos_authentication_is_not_enabled.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1020.html + +.. _mrs_03_1020: + +Does an MRS Cluster Support Access Permission Control If Kerberos Authentication Is not Enabled? +================================================================================================ + +For MRS cluster 2.1.0 or earlier, choose **System** > **Configuration** > **Permission** on MRS Manager. + +For MRS cluster 3.\ *x* or later, choose **System** > **Permission** on FusionInsight Manager. diff --git a/umn/source/faq/accounts_and_permissions/does_hue_support_account_permission_configuration.rst b/umn/source/faq/accounts_and_permissions/does_hue_support_account_permission_configuration.rst new file mode 100644 index 0000000..6a187e5 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/does_hue_support_account_permission_configuration.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1121.html + +.. _mrs_03_1121: + +Does Hue Support Account Permission Configuration? +================================================== + +Hue does not provide an entry for configuring account permissions on its web UI. However, you can configure user roles and user groups for Hue accounts on the **System** tab on Manager. diff --git a/umn/source/faq/accounts_and_permissions/how_do_i_assign_tenant_management_permission_to_a_new_account.rst b/umn/source/faq/accounts_and_permissions/how_do_i_assign_tenant_management_permission_to_a_new_account.rst new file mode 100644 index 0000000..f227b63 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/how_do_i_assign_tenant_management_permission_to_a_new_account.rst @@ -0,0 +1,37 @@ +:original_name: mrs_03_1035.html + +.. _mrs_03_1035: + +How Do I Assign Tenant Management Permission to a New Account? +============================================================== + +You can assign tenant management permission only in analysis or hybrid clusters, but not in streaming clusters. + +The operations vary depending on the MRS cluster version: + +**Procedure for versions earlier than MRS cluster 3.x:** + +#. Log in to MRS Manager as user **admin**. +#. Choose **System** > **Manage User**. Select the new account, and click **Modify** in the **Operation** column. +#. In **Assign Rights by Role**, click **Select and Add Role**. + + - If you bind the **Manager_tenant** role to the account, the account will have permission to view tenant management information. + - If you bind the **Manager_administrator** role to the account, the account will have permission to view and perform tenant management. + +#. Click **OK**. + +**Procedure for MRS cluster 3.x and later versions:** + +#. Log in to FusionInsight Manager and choose **System** > **Permission** > **User**. + +#. Locate the user and click **Modify**. + + Modify the parameters based on service requirements. + + If you bind the **Manager_tenant** role to the account, the account will have permission to view tenant management information. If you bind the **Manager_administrator** role to the account, the account will have permission to perform tenant management and view related information. + + .. note:: + + It takes about three minutes for the settings to take effect after user group or role permission are modified. + +#. Click **OK**. diff --git a/umn/source/faq/accounts_and_permissions/how_do_i_customize_an_mrs_policy.rst b/umn/source/faq/accounts_and_permissions/how_do_i_customize_an_mrs_policy.rst new file mode 100644 index 0000000..5252c06 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/how_do_i_customize_an_mrs_policy.rst @@ -0,0 +1,25 @@ +:original_name: mrs_03_1118.html + +.. _mrs_03_1118: + +How Do I Customize an MRS Policy? +================================= + +#. On the IAM console, choose **Permissions** in the navigation pane, and click **Create Custom Policy**. + +#. Set a policy name in **Policy Name**. + +#. Set **Scope** to **Project-level service** for MRS. + +#. Specify **Policy View**. The following options are supported: + + - **Visual editor**: Select cloud services, actions, resources, and request conditions from the navigation pane to customize the policy. You do not require knowledge of JSON syntax. + - **JSON**: Edit JSON policies from scratch or based on an existing policy. + + You can also click **Select Existing Policy/Role** in the **Policy Content** area to select an existing policy as the template for modification. + +#. (Optional) Enter a brief description in the **Description** area. + +#. Click **OK**. + +#. Attach the policy to a user group. Users in the group then inherit the permissions defined in this policy. diff --git a/umn/source/faq/accounts_and_permissions/index.rst b/umn/source/faq/accounts_and_permissions/index.rst new file mode 100644 index 0000000..0572377 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_03_2004.html + +.. _mrs_03_2004: + +Accounts and Permissions +======================== + +- :ref:`Does an MRS Cluster Support Access Permission Control If Kerberos Authentication Is not Enabled? ` +- :ref:`How Do I Assign Tenant Management Permission to a New Account? ` +- :ref:`How Do I Customize an MRS Policy? ` +- :ref:`Why Is the Manage User Function Unavailable on the System Page on MRS Manager? ` +- :ref:`Does Hue Support Account Permission Configuration? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + does_an_mrs_cluster_support_access_permission_control_if_kerberos_authentication_is_not_enabled + how_do_i_assign_tenant_management_permission_to_a_new_account + how_do_i_customize_an_mrs_policy + why_is_the_manage_user_function_unavailable_on_the_system_page_on_mrs_manager + does_hue_support_account_permission_configuration diff --git a/umn/source/faq/accounts_and_permissions/why_is_the_manage_user_function_unavailable_on_the_system_page_on_mrs_manager.rst b/umn/source/faq/accounts_and_permissions/why_is_the_manage_user_function_unavailable_on_the_system_page_on_mrs_manager.rst new file mode 100644 index 0000000..5599b56 --- /dev/null +++ b/umn/source/faq/accounts_and_permissions/why_is_the_manage_user_function_unavailable_on_the_system_page_on_mrs_manager.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1037.html + +.. _mrs_03_1037: + +Why Is the Manage User Function Unavailable on the System Page on MRS Manager? +============================================================================== + +Check whether you have the **Manager_administrator** permission. If you do not have this permission, **Manage User** will not be available on the **System** page of MRS Manager. diff --git a/umn/source/faq/alarm_monitoring/how_do_i_understand_the_multi-level_chart_statistics_in_the_hbase_operation_requests_metric.rst b/umn/source/faq/alarm_monitoring/how_do_i_understand_the_multi-level_chart_statistics_in_the_hbase_operation_requests_metric.rst new file mode 100644 index 0000000..8d51e1d --- /dev/null +++ b/umn/source/faq/alarm_monitoring/how_do_i_understand_the_multi-level_chart_statistics_in_the_hbase_operation_requests_metric.rst @@ -0,0 +1,29 @@ +:original_name: mrs_03_1243.html + +.. _mrs_03_1243: + +How Do I Understand the Multi-Level Chart Statistics in the HBase Operation Requests Metric? +============================================================================================ + +The following uses the **Operation Requests on RegionServers** monitoring item as an example: + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **HBase** > **Resource**. On the displayed page, you can view the **Operation Requests on RegionServers** chart. If you click **all**, the top 10 RegionServers ranked by the total number of operation requests in the current cluster are displayed, the statistics interval is 5 minutes. + + |image1| + +2. Click a point in the chart. A level-2 chart is displayed, showing the number of operation requests of all RegionServers in the past 5 minutes. + + |image2| + +3. Click an operation statistics bar chart. A level-3 chart is displayed, showing the distribution of operations in each region within the period. + + |image3| + +4. Click a region name. The distribution chart of operations performed every 5 minutes in the last 12 hours is displayed. You can view the number of operations performed in the period. + + |image4| + +.. |image1| image:: /_static/images/en-us_image_0000001337953138.png +.. |image2| image:: /_static/images/en-us_image_0000001338429394.png +.. |image3| image:: /_static/images/en-us_image_0000001388629905.png +.. |image4| image:: /_static/images/en-us_image_0000001388630241.png diff --git a/umn/source/faq/alarm_monitoring/in_an_mrs_streaming_cluster,_can_the_kafka_topic_monitoring_function_send_alarm_notifications.rst b/umn/source/faq/alarm_monitoring/in_an_mrs_streaming_cluster,_can_the_kafka_topic_monitoring_function_send_alarm_notifications.rst new file mode 100644 index 0000000..cfb6d86 --- /dev/null +++ b/umn/source/faq/alarm_monitoring/in_an_mrs_streaming_cluster,_can_the_kafka_topic_monitoring_function_send_alarm_notifications.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1055.html + +.. _mrs_03_1055: + +In an MRS Streaming Cluster, Can the Kafka Topic Monitoring Function Send Alarm Notifications? +============================================================================================== + +The Kafka topic monitoring function cannot send alarms by email or SMS message. However, you can view alarm information on Manager. diff --git a/umn/source/faq/alarm_monitoring/index.rst b/umn/source/faq/alarm_monitoring/index.rst new file mode 100644 index 0000000..932b168 --- /dev/null +++ b/umn/source/faq/alarm_monitoring/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_2007.html + +.. _mrs_03_2007: + +Alarm Monitoring +================ + +- :ref:`In an MRS Streaming Cluster, Can the Kafka Topic Monitoring Function Send Alarm Notifications? ` +- :ref:`Where Can I View the Running Resource Queues When the Alarm "ALM-18022 Insufficient Yarn Queue Resources" Is Reported? ` +- :ref:`How Do I Understand the Multi-Level Chart Statistics in the HBase Operation Requests Metric? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + in_an_mrs_streaming_cluster,_can_the_kafka_topic_monitoring_function_send_alarm_notifications + where_can_i_view_the_running_resource_queues_when_the_alarm_alm-18022_insufficient_yarn_queue_resources_is_reported + how_do_i_understand_the_multi-level_chart_statistics_in_the_hbase_operation_requests_metric diff --git a/umn/source/faq/alarm_monitoring/where_can_i_view_the_running_resource_queues_when_the_alarm_alm-18022_insufficient_yarn_queue_resources_is_reported.rst b/umn/source/faq/alarm_monitoring/where_can_i_view_the_running_resource_queues_when_the_alarm_alm-18022_insufficient_yarn_queue_resources_is_reported.rst new file mode 100644 index 0000000..c1d23c3 --- /dev/null +++ b/umn/source/faq/alarm_monitoring/where_can_i_view_the_running_resource_queues_when_the_alarm_alm-18022_insufficient_yarn_queue_resources_is_reported.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1222.html + +.. _mrs_03_1222: + +Where Can I View the Running Resource Queues When the Alarm "ALM-18022 Insufficient Yarn Queue Resources" Is Reported? +====================================================================================================================== + +Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Yarn**. In the navigation pane on the left, choose **ResourceManager(Active)** and log in to the native Yarn page. + +For details, see the online help. diff --git a/umn/source/faq/api/how_do_i_configure_the_node_id_parameter_when_using_the_api_for_adjusting_cluster_nodes.rst b/umn/source/faq/api/how_do_i_configure_the_node_id_parameter_when_using_the_api_for_adjusting_cluster_nodes.rst new file mode 100644 index 0000000..f4b1f3d --- /dev/null +++ b/umn/source/faq/api/how_do_i_configure_the_node_id_parameter_when_using_the_api_for_adjusting_cluster_nodes.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1139.html + +.. _mrs_03_1139: + +How Do I Configure the node_id Parameter When Using the API for Adjusting Cluster Nodes? +======================================================================================== + +When you use the API for adjusting cluster nodes, the value of **node_id** is fixed to **node_orderadd**. diff --git a/umn/source/faq/api/index.rst b/umn/source/faq/api/index.rst new file mode 100644 index 0000000..fd5204b --- /dev/null +++ b/umn/source/faq/api/index.rst @@ -0,0 +1,14 @@ +:original_name: mrs_03_2015.html + +.. _mrs_03_2015: + +API +=== + +- :ref:`How Do I Configure the node_id Parameter When Using the API for Adjusting Cluster Nodes? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_configure_the_node_id_parameter_when_using_the_api_for_adjusting_cluster_nodes diff --git a/umn/source/faq/big_data_service_development/can_i_export_the_query_result_of_hive_data.rst b/umn/source/faq/big_data_service_development/can_i_export_the_query_result_of_hive_data.rst new file mode 100644 index 0000000..1d3ee04 --- /dev/null +++ b/umn/source/faq/big_data_service_development/can_i_export_the_query_result_of_hive_data.rst @@ -0,0 +1,12 @@ +:original_name: mrs_03_1149.html + +.. _mrs_03_1149: + +Can I Export the Query Result of Hive Data? +=========================================== + +Run the following statement to export the query result of Hive data: + +.. code-block:: + + insert overwrite local directory "/tmp/out/" row format delimited fields terminated by "\t" select * from table; diff --git a/umn/source/faq/big_data_service_development/can_mrs_run_multiple_flume_tasks_at_a_time.rst b/umn/source/faq/big_data_service_development/can_mrs_run_multiple_flume_tasks_at_a_time.rst new file mode 100644 index 0000000..11bb872 --- /dev/null +++ b/umn/source/faq/big_data_service_development/can_mrs_run_multiple_flume_tasks_at_a_time.rst @@ -0,0 +1,24 @@ +:original_name: mrs_03_1059.html + +.. _mrs_03_1059: + +Can MRS Run Multiple Flume Tasks at a Time? +=========================================== + +The Flume client supports multiple independent data flows. You can configure and link multiple sources, channels, and sinks in the **properties.properties** configuration file. These components can be linked to form multiple flows. + +The following is an example of configuring two data flows in a configuration file: + +.. code-block:: + + server.sources = source1 source2 + server.sinks = sink1 sink2 + server.channels = channel1 channel2 + + #dataflow1 + server.sources.source1.channels = channel1 + server.sinks.sink1.channel = channel1 + + #dataflow2 + server.sources.source2.channels = channel2 + server.sinks.sink2.channel = channel2 diff --git a/umn/source/faq/big_data_service_development/can_mrs_write_data_to_hbase_through_the_hbase_external_table_of_hive.rst b/umn/source/faq/big_data_service_development/can_mrs_write_data_to_hbase_through_the_hbase_external_table_of_hive.rst new file mode 100644 index 0000000..77b57b6 --- /dev/null +++ b/umn/source/faq/big_data_service_development/can_mrs_write_data_to_hbase_through_the_hbase_external_table_of_hive.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1044.html + +.. _mrs_03_1044: + +Can MRS Write Data to HBase Through the HBase External Table of Hive? +===================================================================== + +No. Hive on HBase supports only data query. diff --git a/umn/source/faq/big_data_service_development/can_the_hive_driver_be_interconnected_with_dbcp2.rst b/umn/source/faq/big_data_service_development/can_the_hive_driver_be_interconnected_with_dbcp2.rst new file mode 100644 index 0000000..60c9a53 --- /dev/null +++ b/umn/source/faq/big_data_service_development/can_the_hive_driver_be_interconnected_with_dbcp2.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1049.html + +.. _mrs_03_1049: + +Can the Hive Driver Be Interconnected with DBCP2? +================================================= + +The Hive driver cannot be interconnected with the DBCP2 database connection pool. The DBCP2 database connection pool invokes the **isValid** method to check whether a connection is available. However, Hive directly throws an exception when implementing this method. diff --git a/umn/source/faq/big_data_service_development/does_opentsdb_support_python_apis.rst b/umn/source/faq/big_data_service_development/does_opentsdb_support_python_apis.rst new file mode 100644 index 0000000..8d3f974 --- /dev/null +++ b/umn/source/faq/big_data_service_development/does_opentsdb_support_python_apis.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1070.html + +.. _mrs_03_1070: + +Does OpenTSDB Support Python APIs? +================================== + +OpenTSDB supports Python APIs. OpenTSDB provides HTTP-based RESTful APIs that are language-independent. Any language that supports HTTP requests can interconnect to OpenTSDB. diff --git a/umn/source/faq/big_data_service_development/how_do_i_balance_hdfs_data.rst b/umn/source/faq/big_data_service_development/how_do_i_balance_hdfs_data.rst new file mode 100644 index 0000000..5e203cd --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_balance_hdfs_data.rst @@ -0,0 +1,28 @@ +:original_name: mrs_03_1113.html + +.. _mrs_03_1113: + +How Do I Balance HDFS Data? +=========================== + +#. Log in to the master node of the cluster and run the corresponding command to configure environment variables. **/opt/client** indicates the client installation directory. Replace it with the actual one. + + **source /opt/client/bigdata_env** + + **kinit** **Component service user** (If Kerberos authentication is enabled for the cluster, run this command to authenticate the user. Skip this step if the Kerberos authentication is disabled.) + +#. Run the following command to start the balancer: + + **/opt/client/HDFS/hadoop/sbin/start-balancer.sh -threshold 5** + +#. View the log. + + After you execute the balance task, the **hadoop-root-balancer-**\ *Host name*\ **.log** log file will be generated in the client installation directory **/opt/client/HDFS/hadoop/logs**. + +#. (Optional) If you do not want to perform data balancing, run the following commands to stop the balancer: + + **source /opt/client/bigdata_env** + + **kinit** **Component service user** (If Kerberos authentication is enabled for the cluster, run this command to authenticate the user. Skip this step if the Kerberos authentication is disabled.) + + **/opt/client/HDFS/hadoop/sbin/stop-balancer.sh -threshold 5** diff --git a/umn/source/faq/big_data_service_development/how_do_i_change_flumeclient_logs_to_standard_logs.rst b/umn/source/faq/big_data_service_development/how_do_i_change_flumeclient_logs_to_standard_logs.rst new file mode 100644 index 0000000..c3e298c --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_change_flumeclient_logs_to_standard_logs.rst @@ -0,0 +1,28 @@ +:original_name: mrs_03_1058.html + +.. _mrs_03_1058: + +How Do I Change FlumeClient Logs to Standard Logs? +================================================== + +#. Log in to the node where FlumeClient is running. + +#. Go to the FlumeClient installation directory. + + For example, if the FlumeClient installation directory is **/opt/FlumeClient**, run the following command: + + **cd /opt/FlumeClient/fusioninsight-flume-1.9.0/bin** + +#. Run the **./flume-manage.sh stop** command to stop FlumeClient. + +#. Run the **vi /log4j.properties** command to open the **log4j.properties** file and change the value of **flume.root.logger** to **${flume.log.level},console**. + +#. Run the **vim /flume-manager.sh** command to open the **flume-manager.sh** script in the **bin** directory in the Flume installation directory. + +#. Comment out the following information in the **flume-manager.sh** script: + + **>/dev/null 2>&1 &** + +#. Run the **./flume-manage.sh start** command to restart FlumeClient. + +#. After the modification, check whether the Docker configuration is correct. diff --git a/umn/source/faq/big_data_service_development/how_do_i_change_the_number_of_hdfs_replicas.rst b/umn/source/faq/big_data_service_development/how_do_i_change_the_number_of_hdfs_replicas.rst new file mode 100644 index 0000000..44ef967 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_change_the_number_of_hdfs_replicas.rst @@ -0,0 +1,22 @@ +:original_name: mrs_03_1061.html + +.. _mrs_03_1061: + +How Do I Change the Number of HDFS Replicas? +============================================ + +#. Go to the HDFS service configuration page. + + - For MRS cluster versions earlier than 1.9.2: + + Log in to MRS Manager, choose **Services** > **HDFS** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **HDFS** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - MRS 3.\ *x* or later: Log in to FusionInsight Manager. And choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations** > **All Configurations**. + +#. Search for **dfs.replication**, change the value (value range: 1 to 16), and restart the HDFS instance. diff --git a/umn/source/faq/big_data_service_development/how_do_i_check_whether_the_resourcemanager_configuration_of_yarn_is_correct.rst b/umn/source/faq/big_data_service_development/how_do_i_check_whether_the_resourcemanager_configuration_of_yarn_is_correct.rst new file mode 100644 index 0000000..d3e6c08 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_check_whether_the_resourcemanager_configuration_of_yarn_is_correct.rst @@ -0,0 +1,49 @@ +:original_name: mrs_03_1163.html + +.. _mrs_03_1163: + +How Do I Check Whether the ResourceManager Configuration of Yarn Is Correct? +============================================================================ + +#. .. _mrs_03_1163__li92876413199: + + Log in to MRS Manager and choose **Services** > **Yarn** > **Instance**. + +#. Synchronize the configuration between the two ResourceManager nodes.Perform the following steps on each ResourceManager node:Click the name of the ResourceManager node, and choose **More** > **Synchronize Configuration**.In the dialog box displayed, deselect **Restart services or instances whose configurations have expired** and click **Yes**. + +#. Click **Yes** to synchronize the configuration. + +#. Log in to the Master nodes as user **root**. + +#. Run the **cd /opt/Bigdata/MRS_Current/*_*_ResourceManager/etc_UPDATED/** command to go to the **etc_UPDATED** directory. + +#. Run the **grep '\\.queues' capacity-scheduler.xml -A2** command to display all configured queues and check whether the queues are consistent with those displayed on Manager. + + **root-default** is hidden on the Manager page. + + |image1| + +#. .. _mrs_03_1163__li941013146411: + + Run the **grep '\\.capacity' capacity-scheduler.xml -A2** command to display the value of each queue and check whether the value of each queue is the same as that displayed on Manager. Check whether the sum of the values configured for all queues is **100**. + + - If the sum is **100**, the configuration is correct. + - If the sum is not **100**, the configuration is incorrect. Perform the following steps to rectify the fault. + +#. Log in to MRS Manager, and select **Hosts**. + +#. Determine the active Master node. The host name of the active Master node starts with a solid pentagon. + +#. Log in to the active Master node as user **root**. + +#. Run the **su - omm** command to switch to user **omm**. + +#. Run the **sh /opt/Bigdata/om-0.0.1/sbin/restart-controller.sh** command to restart the controller when no operation is being performed on Manager. + + Restarting the controller will not affect the big data component services. + +#. Repeat :ref:`1 ` to :ref:`7 ` to synchronize ResourceManager configurations and check whether the configurations are correct. + + If the latest configuration has not been loaded after the configuration synchronization is complete, a message will be displayed on the Manager page indicating that the configuration has expired. However, this will not affect services. The latest configuration will be automatically loaded when the component restarts. + +.. |image1| image:: /_static/images/en-us_image_0293131436.png diff --git a/umn/source/faq/big_data_service_development/how_do_i_configure_other_data_sources_on_presto.rst b/umn/source/faq/big_data_service_development/how_do_i_configure_other_data_sources_on_presto.rst new file mode 100644 index 0000000..86701f3 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_configure_other_data_sources_on_presto.rst @@ -0,0 +1,71 @@ +:original_name: mrs_03_1147.html + +.. _mrs_03_1147: + +How Do I Configure Other Data Sources on Presto? +================================================ + +In this section, MySQL is used as an example. + +- For MRS 1.\ *x* and 3.\ *x* clusters, do the following: + + #. Log in to the MRS management console. + + #. Click the name of the cluster to go to its details page. + + #. Click the **Components** tab and then **Presto** in the component list. On the page that is displayed, click the **Configurations** tab then the **All Configurations** sub-tab. + + #. On the Presto configuration page that is displayed, find **connector-customize**. + + #. Set **Name** and **Value** as follows: + + **Name**: **mysql.connector.name** + + **Value**: **mysql** + + #. Click the plus sign (+) to add three more fields and set **Name** and **Value** according to the table below. Then click **Save**. + + +---------------------------+-----------------------------------+--------------------------+ + | Name | Value | Description | + +===========================+===================================+==========================+ + | mysql.connection-url | jdbc:mysql://xxx.xxx.xxx.xxx:3306 | Database connection pool | + +---------------------------+-----------------------------------+--------------------------+ + | mysql.connection-user | xxxx | Database username | + +---------------------------+-----------------------------------+--------------------------+ + | mysql.connection-password | xxxx | Database password | + +---------------------------+-----------------------------------+--------------------------+ + + #. Restart the Presto service. + + #. Run the following command to connect to the Presto Server of the cluster: + + **presto_cli.sh --**\ *krb5-config*\ **-**\ *path* {krb5.conf path} --*krb5-principal* {User principal} --*krb5-keytab-path* {user.keytab path} --*user* {presto username} + + #. Log in to Presto and run the **show catalogs** command to check whether the data source list mysql of Presto can be queried. + + |image1| + + Run the **show schemas from mysql** command to query the MySQL database. + +- For MRS 2.\ *x* clusters, do the following: + + #. Create the **mysql.properties** configuration file containing the following content: + + connector.name=mysql + + connection-url=jdbc:mysql://mysqlIp:3306 + + connection-user=Username + + connection-password=Password + + .. note:: + + - **mysqlIp** indicates the IP address of the MySQL instance, which must be able to communicate with the MRS network. + - The username and password are those used to log in to the MySQL database. + + #. Upload the configuration file to the **/opt/Bigdata/MRS_Current/1_14_Coordinator/etc/catalog/** directory on the master node (where the Coordinator instance resides) and the **/opt/Bigdata/MRS_Current/1_14_Worker/etc/catalog/** directory on the core node (depending on the actual directory in the cluster), and change the file owner group to **omm:wheel**. + + #. Restart the Presto service. + +.. |image1| image:: /_static/images/en-us_image_0000001261300062.png diff --git a/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_beeline_from_mrs.rst b/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_beeline_from_mrs.rst new file mode 100644 index 0000000..a0695c1 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_beeline_from_mrs.rst @@ -0,0 +1,43 @@ +:original_name: mrs_03_1158.html + +.. _mrs_03_1158: + +How Do I Connect to Spark Beeline from MRS? +=========================================== + +#. Log in to the master node in the cluster as user **root**. + +#. Run the following command to configure environment variables: + + **source** *Client installation directory*\ **/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, authenticate the user. If Kerberos authentication is disabled, skip this step. + + Command: **kinit** *MRS cluster user* + + Example: + + - If the user is a machine-machine user, run **kinit -kt user.keytab sparkuser**. + - If the user is a human-machine user, run **kinit sparkuser**. + +#. Run the following command to connect to Spark Beeline: + + **spark-beeline** + +#. Run commands on Spark Beeline. For example, create the table **test** in the **obs://mrs-word001/table/** directory. + + **create table test(id int) location 'obs://mrs-word001/table/';** + +#. Query all tables. + + **show tables;** + + If the table **test** is displayed in the command output, OBS is successfully accessed. + + + .. figure:: /_static/images/en-us_image_0264281176.png + :alt: **Figure 1** Returned table name + + **Figure 1** Returned table name + +#. Press **Ctrl+C** to exit the Spark Beeline. diff --git a/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_shell_from_mrs.rst b/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_shell_from_mrs.rst new file mode 100644 index 0000000..cda32fc --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_connect_to_spark_shell_from_mrs.rst @@ -0,0 +1,25 @@ +:original_name: mrs_03_1157.html + +.. _mrs_03_1157: + +How Do I Connect to Spark Shell from MRS? +========================================= + +#. Log in to the Master node in the cluster as user **root**. + +#. Run the following command to configure environment variables: + + **source** *Client installation directory*\ **/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, authenticate the user. If Kerberos authentication is disabled, skip this step. + + Command: **kinit** *MRS cluster user* + + Example: + + - If the user is a machine-machine user, run **kinit -kt user.keytab sparkuser**. + - If the user is a human-machine user, run **kinit sparkuser**. + +#. Run the following command to connect to Spark shell: + + **spark-shell** diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_a_hivesql_hivescript_job_fails_to_submit_after_hive_is_added.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_a_hivesql_hivescript_job_fails_to_submit_after_hive_is_added.rst new file mode 100644 index 0000000..805ecfb --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_a_hivesql_hivescript_job_fails_to_submit_after_hive_is_added.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1200.html + +.. _mrs_03_1200: + +How Do I Do If a "hivesql/hivescript" Job Fails to Submit After Hive Is Added? +============================================================================== + +This issue occurs because the **MRS CommonOperations** permission bound to the user group to which the user who submits the job belongs does not include the Hive permission after being synchronized to Manager. To solve this issue, perform the following operations: + +#. Add the Hive service. +#. Log in to the IAM console and create a user group. The policy bound to the user group is the same as that of the user group to which the user who submits the job belongs. +#. Add the user who submits the job to the new user group. +#. Refresh the cluster details page on the MRS console. The status of IAM user synchronization is **Not synchronized**. +#. Click **Synchronize** on the right of **IAM User Sync**. Go back to the previous page. In the navigation pane on the left, choose **Operation Logs** and check whether the user is changed. + + - If yes, submit the Hive job again. + - If no, check whether all the preceding operations are complete. + + - If yes, contact the O&M personnel. + - If no, submit the Hive job after the preceding operations are complete. diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_an_alarm_indicating_insufficient_memory_is_reported_during_spark_task_execution.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_an_alarm_indicating_insufficient_memory_is_reported_during_spark_task_execution.rst new file mode 100644 index 0000000..9f42dd1 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_an_alarm_indicating_insufficient_memory_is_reported_during_spark_task_execution.rst @@ -0,0 +1,27 @@ +:original_name: mrs_03_1206.html + +.. _mrs_03_1206: + +How Do I Do If an Alarm Indicating Insufficient Memory Is Reported During Spark Task Execution? +=============================================================================================== + +Symptom +------- + +When a Spark task is executed, an alarm indicating insufficient memory is reported. The alarm ID is 18022. As a result, no available memory can be used. + +Procedure +--------- + +Set the executor parameters in the SQL script to limit the number of cores and memory of an executor. + +For example, the configuration is as follows: + +.. code-block:: + + set hive.execution.engine=spark; + set spark.executor.cores=2; + set spark.executor.memory=4G; + set spark.executor.instances=10; + +Change the values of the parameters as required. diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_an_error_occurs_when_hive_runs_the_beeline_-e_command_to_execute_multiple_statements.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_an_error_occurs_when_hive_runs_the_beeline_-e_command_to_execute_multiple_statements.rst new file mode 100644 index 0000000..4193933 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_an_error_occurs_when_hive_runs_the_beeline_-e_command_to_execute_multiple_statements.rst @@ -0,0 +1,27 @@ +:original_name: mrs_03_1194.html + +.. _mrs_03_1194: + +How Do I Do If an Error Occurs When Hive Runs the **beeline -e** Command to Execute Multiple Statements? +======================================================================================================== + +When Hive of MRS 3.\ *x* runs the **beeline -e " use default;show tables;"** command, the following error message is displayed: Error while compiling statement: FAILED: ParseException line 1:11 missing EOF at ';' near 'default' (state=42000,code=40000). + +Solutions: + +- Method 1: Replace the **beeline -e " use default;show tables;"** command with **beeline --entirelineascommand=false -e "use default;show tables;"**. +- Method 2: + + #. In the **/opt/Bigdata/client/Hive** directory on the Hive client, change **export CLIENT_HIVE_ENTIRELINEASCOMMAND=true** in the **component_env** file to **export CLIENT_HIVE_ENTIRELINEASCOMMAND=false**. + + + .. figure:: /_static/images/en-us_image_0000001205479339.png + :alt: **Figure 1** Changing the **component_env** file + + **Figure 1** Changing the **component_env** file + + #. Run the following command to verify the configuration: + + **source /opt/Bigdata/client/bigdata_env** + + **beeline -e " use default;show tables;"** diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_clickhouse_consumes_excessive_cpu_resources.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_clickhouse_consumes_excessive_cpu_resources.rst new file mode 100644 index 0000000..d3201db --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_clickhouse_consumes_excessive_cpu_resources.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1209.html + +.. _mrs_03_1209: + +How Do I Do If ClickHouse Consumes Excessive CPU Resources? +=========================================================== + +Symptom +------- + +A user performs a large number of update operations using ClickHouse. This operation on a ClickHouse consumes a large number of resources. In addition, the operation will be executed again if it fails. As a result, retries of those failed operations occupy too many CPU resources. + +Procedure +--------- + +Delete existing data from ZooKeeper and release delete the update statement. diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_error_message_not_authorized_to_access_group_xxx_is_displayed_when_a_kafka_topic_is_consumed.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_error_message_not_authorized_to_access_group_xxx_is_displayed_when_a_kafka_topic_is_consumed.rst new file mode 100644 index 0000000..8c3b8ba --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_error_message_not_authorized_to_access_group_xxx_is_displayed_when_a_kafka_topic_is_consumed.rst @@ -0,0 +1,12 @@ +:original_name: mrs_03_1197.html + +.. _mrs_03_1197: + +How Do I Do If Error Message "Not Authorized to access group xxx" Is Displayed When a Kafka Topic Is Consumed? +============================================================================================================== + +This issue is caused by the conflict between the Ranger authentication and ACL authentication of a cluster. If a Kafka cluster uses ACL for permission access control and Ranger authentication is enabled for the Kafka component, all authentications of the component are managed by Ranger. The permissions set by the original authentication plug-in are invalid. As a result, ACL authorization does not take effect. You can disable Ranger authentication of Kafka and restart the Kafka service to rectify the fault. The procedure is as follows: + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Kafka**. +#. In the upper right corner of the **Dashboard** page, click **More** and choose **Disable Ranger**. In the displayed dialog box, enter the password and click **OK**. After the operation is successful, click **Finish**. +#. In the upper right corner of the **Dashboard** page, click **More** and choose **Restart Service** to restart the Kafka service. diff --git a/umn/source/faq/big_data_service_development/how_do_i_do_if_sessions_are_not_released_after_hue_connects_to_hiveserver_and_the_error_message_over_max_user_connections_is_displayed.rst b/umn/source/faq/big_data_service_development/how_do_i_do_if_sessions_are_not_released_after_hue_connects_to_hiveserver_and_the_error_message_over_max_user_connections_is_displayed.rst new file mode 100644 index 0000000..a8da0bb --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_do_if_sessions_are_not_released_after_hue_connects_to_hiveserver_and_the_error_message_over_max_user_connections_is_displayed.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1214.html + +.. _mrs_03_1214: + +How Do I Do If Sessions Are Not Released After Hue Connects to HiveServer and the Error Message "over max user connections" Is Displayed? +========================================================================================================================================= + +Applicable versions: MRS 3.1.0 and earlier + +#. Modify the following file on the two Hue nodes: + + /opt/Bigdata/FusionInsight_Porter_8.*/install/FusionInsight-Hue-``*``/hue/apps/beeswax/src/beeswax/models.py + +#. Change the configurations in lines 396 and 404. + + Change **q Changed = self.filter(owner=user, application=application).exclude(guid='').exclude(secret='')** to **q = self.filter(owner=user, application=application).exclude(guid=None).exclude(secret=None)**. + + |image1| + +.. |image1| image:: /_static/images/en-us_image_0000001195220440.png diff --git a/umn/source/faq/big_data_service_development/how_do_i_enable_the_map_type_on_clickhouse.rst b/umn/source/faq/big_data_service_development/how_do_i_enable_the_map_type_on_clickhouse.rst new file mode 100644 index 0000000..10fb8ae --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_enable_the_map_type_on_clickhouse.rst @@ -0,0 +1,48 @@ +:original_name: mrs_03_1217.html + +.. _mrs_03_1217: + +How Do I Enable the Map Type on ClickHouse? +=========================================== + +#. Log in to the active Master node as user **root**. + +#. Run the following command to modify the **/opt/Bigdata/components/current/ClickHouse/configurations.xml** configuration file to enable user parameter customization: + + **vim /opt/Bigdata/components/current/ClickHouse/configurations.xml** + + Change **hidden** to **advanced**, as shown in the following information in bold. Then save the configuration and exit. + + .. code-block:: + + + _clickhouse.custom_content.key + _user-xml-content + + + _user-xml-content + <yandex></yandex> + RESID_CLICKHOUSE_CONF_0025 + + +#. Run the following commands to switch to user **omm** and restart the controller service: + + **su - omm** + + **sh /opt/Bigdata/om-server/om/sbin/restart-controller.sh** + +#. Log in to FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse**. On the page that is displayed, click the **Configurations** tab then the **All Configurations** sub-tab. Click **ClickHouseServer(Role)** > **Customization**, and add the following content to the **\_user-xml-content** configuration item in the right pane: + + .. code-block:: + + + + + 1 + + + + +#. Click **Save**. + +#. Choose **Cluster** > **Services** > **ClickHouse**. In the upper right corner, choose **More** > **Restart Service** to restart the ClickHouse service. diff --git a/umn/source/faq/big_data_service_development/how_do_i_handle_the_kudu_service_exceptions_generated_during_cluster_creation.rst b/umn/source/faq/big_data_service_development/how_do_i_handle_the_kudu_service_exceptions_generated_during_cluster_creation.rst new file mode 100644 index 0000000..e4242cb --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_handle_the_kudu_service_exceptions_generated_during_cluster_creation.rst @@ -0,0 +1,63 @@ +:original_name: mrs_03_1169.html + +.. _mrs_03_1169: + +How Do I Handle the Kudu Service Exceptions Generated During Cluster Creation? +============================================================================== + +Viewing the Kudu Service Exception Logs +--------------------------------------- + +#. Log in to the MRS console. + +#. Click the name of the cluster. + +#. On the page displayed, choose **Components** > **Kudu** > **Instances** and locate the IP address of the abnormal instance. + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +#. Log in to the node where the abnormal instance resides, and view the Kudu log. + + .. code-block:: console + + cd /var/log/Bigdata/Kudu + [root@node-master1AERu kudu]# ls + healthchecklog runninglog startlog + + You can find the Kudu health check logs in the **healthchecklog** directory, the startup logs in the **startlog** directory, and the Kudu process run logs in the **runninglog** directory. + + .. code-block:: console + + [root@node-master1AERu logs]# pwd + /var/log/Bigdata/kudu/runninglog/master/logs + [root@node-master1AERu logs]# ls -al + kudu-master.ERROR kudu-master.INFO kudu-master.WARNING + + Run logs are classified into three types: ERROR, INFO, and WARNING. Each type of run logs is recorded in the corresponding file. You can run the **cat** command to view run logs of each type. + +Handling Kudu Service Exceptions +-------------------------------- + +The **/var/log/Bigdata/kudu/runninglog/master/logs/kudu-master.INFO** file contains the following error information: + +.. code-block:: + + "Unable to init master catalog manager: not found: Unable to initialize catalog manager: Failed to initialize sys tables async: Unable to load consensus metadata for tablet 0000000000000000000000: xxx" + +If this exception occurs when the Kudu service is installed for the first time, the KuduMaster service is not started. The data inconsistency causes the startup failure. To solve the problem, perform the following steps to clear the data directories and restart the Kudu service. If the Kudu service is not installed for the first time, clearing the data directories will cause data loss. In this case, migrate data and clear the data directory. + +#. Search for the data directories **fs_data_dir**, **fs_wal_dir**, and **fs_meta_dir**. + + **find /opt -name master.gflagfile** + + **cat /opt/Bigdata/FusionInsight_Kudu_*/*_KuduMaster/etc/master.gflagfile \| grep fs\_** + +#. On the cluster details page, choose **Components** > **Kudu** and click **Stop Service**. + +#. Clear the Kudu data directories on all KuduMaster and KuduTserver nodes. The following command uses two data disks as an example. + + **rm -Rvf /srv/Bigdata/data1/kudu, rm -Rvf /srv/Bigdata/data2/kudu** + +#. On the cluster details page, choose **Components** > **Kudu** and choose **More** > **Restart Service**. + +#. Check the Kudu service status and logs. diff --git a/umn/source/faq/big_data_service_development/how_do_i_modify_the_allow_drop_detached_parameter_of_clickhouse.rst b/umn/source/faq/big_data_service_development/how_do_i_modify_the_allow_drop_detached_parameter_of_clickhouse.rst new file mode 100644 index 0000000..43b7664 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_modify_the_allow_drop_detached_parameter_of_clickhouse.rst @@ -0,0 +1,45 @@ +:original_name: mrs_03_1210.html + +.. _mrs_03_1210: + +How Do I Modify the allow_drop_detached Parameter of ClickHouse? +================================================================ + +#. Log in to the node where the ClickHouse client is located as user **root**. + +#. Run the following commands to go to the client installation directory and set the environment variables: + + **cd /opt/**\ *Client installation directory* + + **source** **bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, run the following command to authenticate the user. If Kerberos authentication is disabled, skip this step. + + **kinit** *MRS cluster user* + + .. note:: + + The user must have the ClickHouse administrator permissions. + +#. Run the **clickhouse client --host** *192.168.42.90* **--secure -m** command, in which *192.168.42.90* indicates the IP address of the ClickHouseServer instance node. The command output is as follows: + + .. code-block:: console + + [root@server-2110082001-0017 hadoopclient]# clickhouse client --host 192.168.42.90 --secure -m + ClickHouse client version 21.3.4.25. + Connecting to 192.168.42.90:21427. + Connected to ClickHouse server version 21.3.4 revision 54447. + +#. Run the following command to set the value of the **allow_drop_detached** parameter, for example, **1**: + + **set allow_drop_detached=1;** + +#. Run the following command to query the value of the **allow_drop_detached** parameter: + + **SELECT \* FROM system.settings WHERE name = 'allow_drop_detached';** + + |image1| + +#. Run the **q;** command to exit the ClickHouse client. + +.. |image1| image:: /_static/images/en-us_image_0000001223688037.png diff --git a/umn/source/faq/big_data_service_development/how_do_i_modify_the_hdfs_active_standby_switchover_class.rst b/umn/source/faq/big_data_service_development/how_do_i_modify_the_hdfs_active_standby_switchover_class.rst new file mode 100644 index 0000000..4d1d63d --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_modify_the_hdfs_active_standby_switchover_class.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1196.html + +.. _mrs_03_1196: + +How Do I Modify the HDFS Active/Standby Switchover Class? +========================================================= + +If the **org.apache.hadoop.hdfs.server.namenode.ha.AdaptiveFailoverProxyProvider** class is unavailable when a cluster of MRS 3.\ *x* connects to NameNodes using HDFS, the cause is that the HDFS active/standby switchover class of the cluster is configured improperly. To solve the problem, perform the following operations: + +- Method 1: Add the **hadoop-plugins-xxx.jar** package to the **classpath** or **lib** directory of your program. + + The **hadoop-plugins-xxx.jar** package is stored in the HDFS client directory, for example, **$HADOOP_HOME/share/hadoop/common/lib/hadoop-plugins-8.0.2-302023.jar**. + +- Method 2: Change the configuration item of HDFS to the corresponding open source class, as shown in the follows: + + dfs.client.failover.proxy.provider.hacluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider diff --git a/umn/source/faq/big_data_service_development/how_do_i_obtain_the_client_version_of_mrs_kafka.rst b/umn/source/faq/big_data_service_development/how_do_i_obtain_the_client_version_of_mrs_kafka.rst new file mode 100644 index 0000000..c489abb --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_obtain_the_client_version_of_mrs_kafka.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1145.html + +.. _mrs_03_1145: + +How Do I Obtain the Client Version of MRS Kafka? +================================================ + +Run the **--bootstrap-server** command to query the information about the client. diff --git a/umn/source/faq/big_data_service_development/how_do_i_reset_kafka_data.rst b/umn/source/faq/big_data_service_development/how_do_i_reset_kafka_data.rst new file mode 100644 index 0000000..fce04cc --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_reset_kafka_data.rst @@ -0,0 +1,13 @@ +:original_name: mrs_03_1106.html + +.. _mrs_03_1106: + +How Do I Reset Kafka Data? +========================== + +You can reset Kafka data by deleting Kafka topics. + +- Delete a topic: **kafka-topics.sh --delete --zookeeper** *ZooKeeper Cluster service IP address*\ **:2181/kafka --topic topicname** +- Query all topics: **kafka-topics.sh --zookeeper** *ZooKeeper cluster service IP address*\ **:2181/kafka --list** + +After the deletion command is executed, empty topics will be deleted immediately. If a topic has data, the topic will be marked for deletion and will be deleted by Kafka later. diff --git a/umn/source/faq/big_data_service_development/how_do_i_set_the_ttl_for_an_hbase_table.rst b/umn/source/faq/big_data_service_development/how_do_i_set_the_ttl_for_an_hbase_table.rst new file mode 100644 index 0000000..d0f4e73 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_set_the_ttl_for_an_hbase_table.rst @@ -0,0 +1,22 @@ +:original_name: mrs_03_1140.html + +.. _mrs_03_1140: + +How Do I Set the TTL for an HBase Table? +======================================== + +- Set the time to live (TTL) when creating a table: + + Create the **t_task_log** table, set the column family to **f**, and set the TTL to **86400** seconds. + + .. code-block:: + + create 't_task_log',{NAME => 'f', TTL=>'86400'} + +- Set the TTL for an existing table: + + .. code-block:: + + disable "t_task_log" #Disable the table (services must be stopped). + alter "t_task_log",NAME=>'data',TTL=>'86400' # Set the TTL value for the column family data. + enable "t_task_log" #Restore the table. diff --git a/umn/source/faq/big_data_service_development/how_do_i_specify_a_log_path_when_submitting_a_task_in_an_mrs_storm_cluster.rst b/umn/source/faq/big_data_service_development/how_do_i_specify_a_log_path_when_submitting_a_task_in_an_mrs_storm_cluster.rst new file mode 100644 index 0000000..93985ea --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_specify_a_log_path_when_submitting_a_task_in_an_mrs_storm_cluster.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1127.html + +.. _mrs_03_1127: + +How Do I Specify a Log Path When Submitting a Task in an MRS Storm Cluster? +=========================================================================== + +You can modify the **/opt/Bigdata/MRS\_**\ *XXX* **/1\_**\ *XX* **\_Supervisor/etc/worker.xml** file on the streaming Core node of MRS, set the value of **filename** to the path, and restart the corresponding instance on Manager. + +You are advised not to modify the default log configuration of MRS. Otherwise, the log system may become abnormal. diff --git a/umn/source/faq/big_data_service_development/how_do_i_view_hbase_logs.rst b/umn/source/faq/big_data_service_development/how_do_i_view_hbase_logs.rst new file mode 100644 index 0000000..3c1d151 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_view_hbase_logs.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1045.html + +.. _mrs_03_1045: + +How Do I View HBase Logs? +========================= + +#. Log in to the Master node in the cluster as user **root**. +#. Run the **su - omm** command to switch to user **omm**. +#. Run the **cd /var/log/Bigdata/hbase/** command to go to the **/var/log/Bigdata/hbase/** directory and view HBase logs. diff --git a/umn/source/faq/big_data_service_development/how_do_i_view_kudu_logs.rst b/umn/source/faq/big_data_service_development/how_do_i_view_kudu_logs.rst new file mode 100644 index 0000000..34f9f56 --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_view_kudu_logs.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1069.html + +.. _mrs_03_1069: + +How Do I View Kudu Logs? +======================== + +#. Log in to the Master node in the cluster. +#. Run the **su - omm** command to switch to user **omm**. +#. Run the **cd /var/log/Bigdata/kudu/** command to go to the **/var/log/Bigdata/kudu/** directory and view Kudu logs. diff --git a/umn/source/faq/big_data_service_development/how_do_i_view_the_hive_table_created_by_another_user.rst b/umn/source/faq/big_data_service_development/how_do_i_view_the_hive_table_created_by_another_user.rst new file mode 100644 index 0000000..8aae86b --- /dev/null +++ b/umn/source/faq/big_data_service_development/how_do_i_view_the_hive_table_created_by_another_user.rst @@ -0,0 +1,71 @@ +:original_name: mrs_03_1082.html + +.. _mrs_03_1082: + +How Do I View the Hive Table Created by Another User? +===================================================== + +Versions earlier than MRS 3.\ *x*: + +#. Log in to MRS Manager and choose **System** > **Permission** > **Manage Role**. +#. Click **Create Role**, and set **Role Name** and **Description**. +#. In the **Permission** table, choose **Hive** > **Hive Read Write Privileges**. +#. In the database list, click the name of the database where the table created by user B is stored. The table is displayed. +#. In the **Permission** column of the table created by user B, select **SELECT**. +#. Click **OK**, and return to the **Role** page. +#. Choose **System** > **Manage User**. Locate the row containing user A, click **Modify** to bind the new role to user A, and click **OK**. After about 5 minutes, user A can access the table created by user B. + +MRS 3.\ *x* or later: + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Hive**. On the displayed page, choose **More**, and check whether **Enable Ranger** is grayed out. + + - If yes, go to :ref:`9 `. + - If no, perform :ref:`2 ` to :ref:`8 `. + +#. .. _mrs_03_1082__li1778559161211: + + Log in to FusionInsight Manager and choose **System** > **Permission** > **Role**. + +#. Click **Create Role**, and set **Role Name** and **Description**. + +#. In the **Configure Resource Permission** table, choose *Name of the desired cluster* > **Hive** > **Hive Read Write Privileges**. + +#. In the database list, click the name of the database where the table created by user B is stored. The table is displayed. + +#. In the **Permission** column of the table created by user B, select **Select**. + +#. Click **OK**, and return to the **Role** page. + +#. .. _mrs_03_1082__li74548548355: + + Choose **Permission** > **User**. On the **Local User** page that is displayed, locate the row containing user A, click **Modify** in the **Operation** column to bind the new role to user A, and click **OK**. After about 5 minutes, user A can access the table created by user B. + +#. .. _mrs_03_1082__li9406195118112: + + Perform the following steps to add the Ranger access permission policy of Hive: + + a. Log in to FusionInsight Manager as a Hive administrator and choose **Cluster** > **Services**. On the page that is displayed, choose **Ranger**. On the displayed page, click the URL next to **Ranger WebUI** to go to the Ranger management page. + b. On the home page, click the component plug-in name in the **HADOOP SQL** area, for example, **Hive**. + c. On the **Access** tab page, click **Add New Policy** to add a Hive permission control policy. + d. In the **Create Policy** dialog box that is displayed, set the following parameters: + + - **Policy Name**: Enter a policy name, for example, **table_test_hive**. + - **database**: Enter or select the database where the table created by user B is stored, for example, **default**. + - **table**: Enter or select the table created by user B, for example, **test**. + - **column**: Enter and select a column, for example, **\***. + - In the **Allow Conditions** area, click **Select User**, select user A, click **Add Permissions**, and select **select**. + - Click **Add**. + +#. Perform the following steps to add the Ranger access permission policy of HDFS: + + a. Log in to FusionInsight Manager as user **rangeradmin** and choose **Cluster** > **Services**. On the page that is displayed, choose **Ranger**. On the displayed page, click the URL next to **Ranger WebUI** to go to the Ranger management page. + b. On the home page, click the component plug-in name in the **HDFS** area, for example, **hacluster**. + c. Click **Add New Policy** to add a HDFS permission control policy. + d. In the **Create Policy** dialog box that is displayed, set the following parameters: + + - **Policy Name**: Enter a policy name, for example, **tablehdfs_test**. + - **Resource Path**: Set this parameter to the HDFS path where the table created by user B is stored, for example, **/user/hive/warehouse/**\ *Database name*\ **/**\ *Table name*. + - In the **Allow Conditions** area, select user A for **Select User**, click **Add Permissions** in the **Permissions** column, and select **Read** and **Execute**. + - Click **Add**. + +#. View basic information about the policy in the policy list. After the policy takes effect, user A can view the table created by user B. diff --git a/umn/source/faq/big_data_service_development/index.rst b/umn/source/faq/big_data_service_development/index.rst new file mode 100644 index 0000000..aff01ad --- /dev/null +++ b/umn/source/faq/big_data_service_development/index.rst @@ -0,0 +1,88 @@ +:original_name: mrs_03_2014.html + +.. _mrs_03_2014: + +Big Data Service Development +============================ + +- :ref:`Can MRS Run Multiple Flume Tasks at a Time? ` +- :ref:`How Do I Change FlumeClient Logs to Standard Logs? ` +- :ref:`Where Are the JAR Files and Environment Variables of Hadoop Stored? ` +- :ref:`What Compression Algorithms Does HBase Support? ` +- :ref:`Can MRS Write Data to HBase Through the HBase External Table of Hive? ` +- :ref:`How Do I View HBase Logs? ` +- :ref:`How Do I Set the TTL for an HBase Table? ` +- :ref:`How Do I Balance HDFS Data? ` +- :ref:`How Do I Change the Number of HDFS Replicas? ` +- :ref:`What Is the Port for Accessing HDFS Using Python? ` +- :ref:`How Do I Modify the HDFS Active/Standby Switchover Class? ` +- :ref:`What Is the Recommended Number Type of DynamoDB in Hive Tables? ` +- :ref:`Can the Hive Driver Be Interconnected with DBCP2? ` +- :ref:`How Do I View the Hive Table Created by Another User? ` +- :ref:`Can I Export the Query Result of Hive Data? ` +- :ref:`How Do I Do If an Error Occurs When Hive Runs the beeline -e Command to Execute Multiple Statements? ` +- :ref:`How Do I Do If a "hivesql/hivescript" Job Fails to Submit After Hive Is Added? ` +- :ref:`What If an Excel File Downloaded on Hue Failed to Open? ` +- :ref:`How Do I Do If Sessions Are Not Released After Hue Connects to HiveServer and the Error Message "over max user connections" Is Displayed? ` +- :ref:`How Do I Reset Kafka Data? ` +- :ref:`How Do I Obtain the Client Version of MRS Kafka? ` +- :ref:`What Access Protocols Are Supported by Kafka? ` +- :ref:`How Do I Do If Error Message "Not Authorized to access group xxx" Is Displayed When a Kafka Topic Is Consumed? ` +- :ref:`What Compression Algorithms Does Kudu Support? ` +- :ref:`How Do I View Kudu Logs? ` +- :ref:`How Do I Handle the Kudu Service Exceptions Generated During Cluster Creation? ` +- :ref:`Does OpenTSDB Support Python APIs? ` +- :ref:`How Do I Configure Other Data Sources on Presto? ` +- :ref:`How Do I Connect to Spark Shell from MRS? ` +- :ref:`How Do I Connect to Spark Beeline from MRS? ` +- :ref:`Where Are the Execution Logs of Spark Jobs Stored? ` +- :ref:`How Do I Specify a Log Path When Submitting a Task in an MRS Storm Cluster? ` +- :ref:`How Do I Check Whether the ResourceManager Configuration of Yarn Is Correct? ` +- :ref:`How Do I Modify the allow_drop_detached Parameter of ClickHouse? ` +- :ref:`How Do I Do If an Alarm Indicating Insufficient Memory Is Reported During Spark Task Execution? ` +- :ref:`How Do I Do If ClickHouse Consumes Excessive CPU Resources? ` +- :ref:`How Do I Enable the Map Type on ClickHouse? ` +- :ref:`It Takes a Long Time for Spark SQL to Access Hive Partitioned Tables Before Job Startup ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + can_mrs_run_multiple_flume_tasks_at_a_time + how_do_i_change_flumeclient_logs_to_standard_logs + where_are_the_jar_files_and_environment_variables_of_hadoop_stored + what_compression_algorithms_does_hbase_support + can_mrs_write_data_to_hbase_through_the_hbase_external_table_of_hive + how_do_i_view_hbase_logs + how_do_i_set_the_ttl_for_an_hbase_table + how_do_i_balance_hdfs_data + how_do_i_change_the_number_of_hdfs_replicas + what_is_the_port_for_accessing_hdfs_using_python + how_do_i_modify_the_hdfs_active_standby_switchover_class + what_is_the_recommended_number_type_of_dynamodb_in_hive_tables + can_the_hive_driver_be_interconnected_with_dbcp2 + how_do_i_view_the_hive_table_created_by_another_user + can_i_export_the_query_result_of_hive_data + how_do_i_do_if_an_error_occurs_when_hive_runs_the_beeline_-e_command_to_execute_multiple_statements + how_do_i_do_if_a_hivesql_hivescript_job_fails_to_submit_after_hive_is_added + what_if_an_excel_file_downloaded_on_hue_failed_to_open + how_do_i_do_if_sessions_are_not_released_after_hue_connects_to_hiveserver_and_the_error_message_over_max_user_connections_is_displayed + how_do_i_reset_kafka_data + how_do_i_obtain_the_client_version_of_mrs_kafka + what_access_protocols_are_supported_by_kafka + how_do_i_do_if_error_message_not_authorized_to_access_group_xxx_is_displayed_when_a_kafka_topic_is_consumed + what_compression_algorithms_does_kudu_support + how_do_i_view_kudu_logs + how_do_i_handle_the_kudu_service_exceptions_generated_during_cluster_creation + does_opentsdb_support_python_apis + how_do_i_configure_other_data_sources_on_presto + how_do_i_connect_to_spark_shell_from_mrs + how_do_i_connect_to_spark_beeline_from_mrs + where_are_the_execution_logs_of_spark_jobs_stored + how_do_i_specify_a_log_path_when_submitting_a_task_in_an_mrs_storm_cluster + how_do_i_check_whether_the_resourcemanager_configuration_of_yarn_is_correct + how_do_i_modify_the_allow_drop_detached_parameter_of_clickhouse + how_do_i_do_if_an_alarm_indicating_insufficient_memory_is_reported_during_spark_task_execution + how_do_i_do_if_clickhouse_consumes_excessive_cpu_resources + how_do_i_enable_the_map_type_on_clickhouse + it_takes_a_long_time_for_spark_sql_to_access_hive_partitioned_tables_before_job_startup diff --git a/umn/source/faq/big_data_service_development/it_takes_a_long_time_for_spark_sql_to_access_hive_partitioned_tables_before_job_startup.rst b/umn/source/faq/big_data_service_development/it_takes_a_long_time_for_spark_sql_to_access_hive_partitioned_tables_before_job_startup.rst new file mode 100644 index 0000000..b5735b3 --- /dev/null +++ b/umn/source/faq/big_data_service_development/it_takes_a_long_time_for_spark_sql_to_access_hive_partitioned_tables_before_job_startup.rst @@ -0,0 +1,39 @@ +:original_name: mrs_03_1248.html + +.. _mrs_03_1248: + +It Takes a Long Time for Spark SQL to Access Hive Partitioned Tables Before Job Startup +======================================================================================= + +Symptom +------- + +When Spark SQL is used to access Hive partitioned tables stored in OBS, the acces speed is slow and a large number of OBS query APIs are called. + +Example SQL: + +.. code-block:: + + select a,b,c from test where b=xxx + +Fault Locating +-------------- + +According to the configuration, the task should scan only the partition whose b is *xxx*. However, the task logs show that the task scans all partitions and then calculates the data whose b is *xxx*. As a result, the task calculation is slow. In addition, a large number of OBS requests are sent because all files need to be scanned. + +By default, the execution plan optimization based on partition statistics is enabled on MRS, which is equivalent to automatic execution of Analyze Table. (The default configuration method is to set **spark.sql.statistics.fallBackToHdfs** to **true**. You can set this parameter to **false**.) After this function is enabled, table partition statistics are scanned during SQL execution and used as cost estimation in the execution plan. For example, small tables identified during cost evaluation are broadcast to each node in the memory for join operations, significantly reducing shuffle time. This function greatly optimizes performance in join scenarios, but increases the number of OBS calls. + +Procedure +--------- + +Set the following parameter in Spark SQL and then run the SQL statement: + +.. code-block:: + + set spark.sql.statistics.fallBackToHdfs=false; + +Alternatively, run the **--conf** command to set this parameter to **false** before startup. + +.. code-block:: + + --conf spark.sql.statistics.fallBackToHdfs=false diff --git a/umn/source/faq/big_data_service_development/what_access_protocols_are_supported_by_kafka.rst b/umn/source/faq/big_data_service_development/what_access_protocols_are_supported_by_kafka.rst new file mode 100644 index 0000000..3a3f13d --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_access_protocols_are_supported_by_kafka.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1146.html + +.. _mrs_03_1146: + +What Access Protocols Are Supported by Kafka? +============================================= + +Kafka supports PLAINTEXT, SSL, SASL_PLAINTEXT, and SASL_SSL. diff --git a/umn/source/faq/big_data_service_development/what_compression_algorithms_does_hbase_support.rst b/umn/source/faq/big_data_service_development/what_compression_algorithms_does_hbase_support.rst new file mode 100644 index 0000000..323db3d --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_compression_algorithms_does_hbase_support.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1042.html + +.. _mrs_03_1042: + +What Compression Algorithms Does HBase Support? +=============================================== + +HBase supports the Snappy, LZ4, and gzip compression algorithms. diff --git a/umn/source/faq/big_data_service_development/what_compression_algorithms_does_kudu_support.rst b/umn/source/faq/big_data_service_development/what_compression_algorithms_does_kudu_support.rst new file mode 100644 index 0000000..db34c6c --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_compression_algorithms_does_kudu_support.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1067.html + +.. _mrs_03_1067: + +What Compression Algorithms Does Kudu Support? +============================================== + +Kudu supports **Snappy**, **LZ4**, and **zlib**. **LZ4** is used by default. diff --git a/umn/source/faq/big_data_service_development/what_if_an_excel_file_downloaded_on_hue_failed_to_open.rst b/umn/source/faq/big_data_service_development/what_if_an_excel_file_downloaded_on_hue_failed_to_open.rst new file mode 100644 index 0000000..cda490c --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_if_an_excel_file_downloaded_on_hue_failed_to_open.rst @@ -0,0 +1,74 @@ +:original_name: mrs_03_1160.html + +.. _mrs_03_1160: + +What If an Excel File Downloaded on Hue Failed to Open? +======================================================= + +.. note:: + + This section applies only to versions earlier than MRS 3.\ *x*. + +#. Log in to a Master node as user **root** and switch to user **omm**. + + **su - omm** + +#. Check whether the current node is the active OMS node. + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh** + + If **active** is displayed in the command output, the node is the active node. Otherwise, log in to the other Master node. + + + .. figure:: /_static/images/en-us_image_0000001439442217.png + :alt: **Figure 1** Active OMS node + + **Figure 1** Active OMS node + +#. Go to the **{BIGDATA_HOME}/Apache-httpd-*/conf** directory. + + **cd ${BIGDATA_HOME}/Apache-httpd-*/conf** + +#. Open the **httpd.conf** file. + + **vim httpd.conf** + +#. Search for **21201** in the file and delete the following content from the file. (The values of *proxy_ip* and *proxy_port* are the same as those in the actual environment.) + + .. code-block:: + + ProxyHTMLEnable On + SetEnv PROXY_PREFIX=https://[proxy_ip]:[proxy_port] + ProxyHTMLURLMap (https?:\/\/[^:]*:[0-9]*.*) ${PROXY_PREFIX}/proxyRedirect=$1 RV + + + .. figure:: /_static/images/en-us_image_0268284607.png + :alt: **Figure 2** Content to be deleted + + **Figure 2** Content to be deleted + +#. Save the modification and exit. + +#. Open the **httpd.conf** file again, search for **proxy_hue_port**, and delete the following content: + + .. code-block:: + + ProxyHTMLEnable On + SetEnv PROXY_PREFIX=https://[proxy_ip]:[proxy_port] + ProxyHTMLURLMap (https?:\/\/[^:]*:[0-9]*.*) ${PROXY_PREFIX}/proxyRedirect=$1 RV + + + .. figure:: /_static/images/en-us_image_0268298534.png + :alt: **Figure 3** Content to be deleted + + **Figure 3** Content to be deleted + +#. Save the modification and exit. + +#. Run the following command to restart the **httpd** process: + + **sh ${BIGDATA_HOME}/Apache-httpd-\*/setup/restarthttpd.sh** + +#. Check whether the **httpd.conf** file on the standby Master node is modified. If the file is modified, no further action is required. If the file is not modified, modify the **httpd.conf** file on the standby Master node in the same way. You do not need to restart the **httpd** process. + +#. Download the Excel file again. You can open the file successfully. diff --git a/umn/source/faq/big_data_service_development/what_is_the_port_for_accessing_hdfs_using_python.rst b/umn/source/faq/big_data_service_development/what_is_the_port_for_accessing_hdfs_using_python.rst new file mode 100644 index 0000000..66eb817 --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_is_the_port_for_accessing_hdfs_using_python.rst @@ -0,0 +1,172 @@ +:original_name: mrs_03_1060.html + +.. _mrs_03_1060: + +What Is the Port for Accessing HDFS Using Python? +================================================= + +The default port of open source HDFS is **50070** for versions earlier than MRS 3.0.0, and **9870** for MRS 3.0.0 or later. :ref:`Common HDFS Ports ` describes the common ports of HDFS. + +.. _mrs_03_1060__section588612577293: + +Common HDFS Ports +----------------- + +The protocol type of all ports in the table is TCP (for MRS 1.7.0 or later). + ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| Parameter | Default Port | Default Port | Port Description | +| | | | | +| | (MRS 1.6.3 or Earlier) | (MRS 1.7.0 or Later) | | ++============================+========================+=============================================+============================================================================================================================+ +| dfs.namenode.rpc.port | 25000 | - 9820 (versions earlier than MRS 3.\ *x*) | NameNode RPC port | +| | | - 8020 (MRS 3.\ *x* and later) | | +| | | | This port is used for: | +| | | | | +| | | | 1. Communication between the HDFS client and NameNode | +| | | | | +| | | | 2. Connection between the DataNode and NameNode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.namenode.http.port | 25002 | 9870 | HDFS HTTP port (NameNode) | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Point-to-point NameNode checkpoint operations. | +| | | | | +| | | | 2. Connecting the remote web client to the NameNode UI | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.namenode.https.port | 25003 | 9871 | HDFS HTTPS port (NameNode) | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Point-to-point NameNode checkpoint operations | +| | | | | +| | | | 2. Connecting the remote web client to the NameNode UI | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.ipc.port | 25008 | 9867 | IPC server port of DataNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connection between the client and DataNode to perform RPC operations. | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.port | 25009 | 9866 | DataNode data transmission port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | 1. Transmitting data from HDFS client from or to the DataNode | +| | | | | +| | | | 2. Point-to-point DataNode data transmission | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.http.port | 25010 | 9864 | DataNode HTTP port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the DataNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.datanode.https.port | 25011 | 9865 | HTTPS port of DataNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the DataNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.JournalNode.rpc.port | 25012 | 8485 | RPC port of JournalNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Client communication to access multiple types of information | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.journalnode.http.port | 25013 | 8480 | JournalNode HTTP port | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the JournalNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| dfs.journalnode.https.port | 25014 | 8481 | HTTPS port of JournalNode | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the JournalNode from the remote web client in security mode | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ +| httpfs.http.port | 25018 | 14000 | Listening port of the HttpFS HTTP server | +| | | | | +| | | | This port is used for: | +| | | | | +| | | | Connecting to the HttpFS from the remote REST API | +| | | | | +| | | | .. note:: | +| | | | | +| | | | The port ID is a recommended value and is specified based on the product. The port range is not restricted in the code. | +| | | | | +| | | | - Is the port enabled by default during the installation: Yes | +| | | | - Is the port enabled after security hardening: Yes | ++----------------------------+------------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/faq/big_data_service_development/what_is_the_recommended_number_type_of_dynamodb_in_hive_tables.rst b/umn/source/faq/big_data_service_development/what_is_the_recommended_number_type_of_dynamodb_in_hive_tables.rst new file mode 100644 index 0000000..8e7753f --- /dev/null +++ b/umn/source/faq/big_data_service_development/what_is_the_recommended_number_type_of_dynamodb_in_hive_tables.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1047.html + +.. _mrs_03_1047: + +What Is the Recommended Number Type of DynamoDB in Hive Tables? +=============================================================== + +**smallint** is recommended. diff --git a/umn/source/faq/big_data_service_development/where_are_the_execution_logs_of_spark_jobs_stored.rst b/umn/source/faq/big_data_service_development/where_are_the_execution_logs_of_spark_jobs_stored.rst new file mode 100644 index 0000000..388515f --- /dev/null +++ b/umn/source/faq/big_data_service_development/where_are_the_execution_logs_of_spark_jobs_stored.rst @@ -0,0 +1,9 @@ +:original_name: mrs_03_1159.html + +.. _mrs_03_1159: + +Where Are the Execution Logs of Spark Jobs Stored? +================================================== + +- Logs of unfinished Spark jobs are stored in the **/srv/BigData/hadoop/data1/nm/containerlogs/** directory on the Core node. +- Logs of finished Spark jobs are stored in the **/tmp/logs/**\ *username*\ **/logs** directory of HDFS. diff --git a/umn/source/faq/big_data_service_development/where_are_the_jar_files_and_environment_variables_of_hadoop_stored.rst b/umn/source/faq/big_data_service_development/where_are_the_jar_files_and_environment_variables_of_hadoop_stored.rst new file mode 100644 index 0000000..5366d75 --- /dev/null +++ b/umn/source/faq/big_data_service_development/where_are_the_jar_files_and_environment_variables_of_hadoop_stored.rst @@ -0,0 +1,12 @@ +:original_name: mrs_03_1064.html + +.. _mrs_03_1064: + +Where Are the JAR Files and Environment Variables of Hadoop Stored? +=================================================================== + +- **hadoopstreaming.jar**: **/opt/share/hadoop-streaming-\*** (**\*** indicates the Hadoop version.) + +- JDK environment variables: **/opt/client/JDK/component_env** +- Hadoop environment variables: **/opt/client/HDFS/component_env** +- Hadoop client: **/opt/client/HDFS/hadoop** diff --git a/umn/source/faq/client_usage/an_error_is_reported_when_the_kinit_command_is_executed_on_a_client_node_outside_an_mrs_cluster.rst b/umn/source/faq/client_usage/an_error_is_reported_when_the_kinit_command_is_executed_on_a_client_node_outside_an_mrs_cluster.rst new file mode 100644 index 0000000..afbddf6 --- /dev/null +++ b/umn/source/faq/client_usage/an_error_is_reported_when_the_kinit_command_is_executed_on_a_client_node_outside_an_mrs_cluster.rst @@ -0,0 +1,35 @@ +:original_name: mrs_03_1251.html + +.. _mrs_03_1251: + +An Error Is Reported When the kinit Command Is Executed on a Client Node Outside an MRS Cluster +=============================================================================================== + +Symptom +------- + +After the client is installed on a node outside an MRS cluster and the **kinit** command is executed, the following error information is displayed: + +.. code-block:: + + -bash kinit Permission denied + +The following error information is displayed when the **java** command is executed: + +.. code-block:: + + -bash: /xxx/java: Permission denied + +After running the **ll /**\ *Java installation path*\ **/JDK/jdk/bin/java** command, it is found that the file execution permission is correct. + +Fault Locating +-------------- + +Run the **mount \| column -t** command to check the status of the mounted partition. It is found that the partition status of the mount point where the Java execution file is located is **noexec**. In the current environment, the data disk where the MRS client is installed is set to **noexec**, that is, binary file execution is prohibited. As a result, Java commands cannot be executed. + +Solution +-------- + +#. Log in to the node where the MRS client is located as user **root**. +#. Remove the configuration item **noexec** of the data disk where the MRS client is located from the **/etc/fstab** file. +#. Run the **umount** command to detach the data disk, and then run the **mount -a** command to remount the data disk. diff --git a/umn/source/faq/client_usage/how_do_i_configure_environment_variables_and_run_commands_on_a_component_client.rst b/umn/source/faq/client_usage/how_do_i_configure_environment_variables_and_run_commands_on_a_component_client.rst new file mode 100644 index 0000000..6c7c39b --- /dev/null +++ b/umn/source/faq/client_usage/how_do_i_configure_environment_variables_and_run_commands_on_a_component_client.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_1031.html + +.. _mrs_03_1031: + +How Do I Configure Environment Variables and Run Commands on a Component Client? +================================================================================ + +#. Log in to any Master node as user **root**. + +#. Run the **su - omm** command to switch to user **omm**. + +#. Run the **cd** *Client installation directory* command to switch to the client. + +#. Run the **source bigdata_env** command to configure environment variables. + + If Kerberos authentication is enabled for the current cluster, run the **kinit** *Component service user* command to authenticate the user. If Kerberos authentication is disabled, skip this step. + +#. After the environment variables are configured, run the client command of the component. For example, to view component information, you can run the HDFS client command **hdfs dfs -ls /** to view the HDFS root directory file. diff --git a/umn/source/faq/client_usage/how_do_i_disable_zookeeper_sasl_authentication.rst b/umn/source/faq/client_usage/how_do_i_disable_zookeeper_sasl_authentication.rst new file mode 100644 index 0000000..c14e2cf --- /dev/null +++ b/umn/source/faq/client_usage/how_do_i_disable_zookeeper_sasl_authentication.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1219.html + +.. _mrs_03_1219: + +How Do I Disable ZooKeeper SASL Authentication? +=============================================== + +Log in to FusionInsight Manager, choose **Cluster** > **Services** > **ZooKeeper**, click the **Configurations** tab and then **All Configurations**. In the navigation pane on the left, choose **quorumpeer(Role)** > **Customization**, add the **set zookeeper.sasl.disable** parameter, and set its value to **false**. Save the configuration and restart the ZooKeeper service. diff --git a/umn/source/faq/client_usage/index.rst b/umn/source/faq/client_usage/index.rst new file mode 100644 index 0000000..298ff72 --- /dev/null +++ b/umn/source/faq/client_usage/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_2005.html + +.. _mrs_03_2005: + +Client Usage +============ + +- :ref:`How Do I Configure Environment Variables and Run Commands on a Component Client? ` +- :ref:`How Do I Disable ZooKeeper SASL Authentication? ` +- :ref:`An Error Is Reported When the kinit Command Is Executed on a Client Node Outside an MRS Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_configure_environment_variables_and_run_commands_on_a_component_client + how_do_i_disable_zookeeper_sasl_authentication + an_error_is_reported_when_the_kinit_command_is_executed_on_a_client_node_outside_an_mrs_cluster diff --git a/umn/source/faq/cluster_access/can_i_switch_between_the_two_login_modes_of_mrs.rst b/umn/source/faq/cluster_access/can_i_switch_between_the_two_login_modes_of_mrs.rst new file mode 100644 index 0000000..599244c --- /dev/null +++ b/umn/source/faq/cluster_access/can_i_switch_between_the_two_login_modes_of_mrs.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1029.html + +.. _mrs_03_1029: + +Can I Switch Between the Two Login Modes of MRS? +================================================ + +No. You can select the login mode when creating the cluster. You cannot change the login mode after you created the cluster. diff --git a/umn/source/faq/cluster_access/how_can_i_obtain_the_ip_address_and_port_number_of_a_zookeeper_instance.rst b/umn/source/faq/cluster_access/how_can_i_obtain_the_ip_address_and_port_number_of_a_zookeeper_instance.rst new file mode 100644 index 0000000..7c5da3c --- /dev/null +++ b/umn/source/faq/cluster_access/how_can_i_obtain_the_ip_address_and_port_number_of_a_zookeeper_instance.rst @@ -0,0 +1,29 @@ +:original_name: mrs_03_1071.html + +.. _mrs_03_1071: + +How Can I Obtain the IP Address and Port Number of a ZooKeeper Instance? +======================================================================== + +You can obtain the IP address and port number of a ZooKeeper instance through the MRS console or FusionInsight Manager. + +Method 1: Obtaining the IP address and port number of a ZooKeeper through the MRS console + +#. On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. +#. Click the **Components** tab and choose **ZooKeeper**. On the displayed page, click **Instances** to view the business IP address of a ZooKeeper instance. +#. Click the **Service Configuration** tab. On the displayed page, search for the **clientPort** parameter to view the port number of the ZooKeeper instance. + +Method 2: Obtaining the IP address and port number of a ZooKeeper through FusionInsight Manager + +#. Log in to FusionInsight Manager. For details, see . +#. Perform the following operations to obtain the IP address and port number of a ZooKeeper instance. + + - For clusters of MRS 3.\ *x* or earlier + + a. Choose **Services** > **ZooKeeper**. On the displayed page, click the **Instance** tab to view the business IP address of a ZooKeeper instance. + b. Click the **Service Configuration** tab. On the displayed page, search for the **clientPort** parameter to view the port number of the ZooKeeper instance. + + - For clusters of MRS 3.\ *x* or later + + a. Choose **Cluster** > **Services** > **ZooKeeper**. On the displayed page, click the **Instance** tab to view the business IP address of a ZooKeeper instance. + b. Click the **Configurations** tab. On the displayed page, search for the **clientPort** parameter to view the port number of the ZooKeeper instance. diff --git a/umn/source/faq/cluster_access/how_do_i_access_an_mrs_cluster_from_a_node_outside_the_cluster.rst b/umn/source/faq/cluster_access/how_do_i_access_an_mrs_cluster_from_a_node_outside_the_cluster.rst new file mode 100644 index 0000000..d880db1 --- /dev/null +++ b/umn/source/faq/cluster_access/how_do_i_access_an_mrs_cluster_from_a_node_outside_the_cluster.rst @@ -0,0 +1,59 @@ +:original_name: mrs_03_1234.html + +.. _mrs_03_1234: + +How Do I Access an MRS Cluster from a Node Outside the Cluster? +=============================================================== + +Creating a Linux ECS Outside the Cluster to Access the MRS Cluster +------------------------------------------------------------------ + +#. Create an ECS outside the cluster. + + Set **AZ**, **VPC**, and **Security Group** of the ECS to the same values as those of the cluster to be accessed. + +2. On the VPC management console, apply for an EIP and bind it to the ECS. +3. Configure security group rules for the cluster. + + a. On the **Dashboard** tab page, click **Add Security Group Rule**. In the **Add Security Group Rule** dialog box that is displayed, click **Manage Security Group Rule**. + + b. Click the **Inbound Rules** tab, and click **Add Rule**. In the **Add Inbound Rule** dialog box, configure the IP address of the ECS and enable all ports. + + c. After the security group rule is added, you can download and install the client on the ECS.. + + d. Use the client. + + Log in to the client node as the client installation user and run the following command to switch to the client directory: + + **cd /opt/hadoopclient** + + Run the following command to load environment variables: + + **source bigdata_env** + + If Kerberos authentication is enabled for the cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, authentication is not required. + + **kinit** *MRS cluster user* + + Example: + + **kinit admin** + + Run the client command of a component. + + Example: + + Run the following command to view files in the HDFS root directory: + + **hdfs dfs -ls /** + + .. code-block:: + + Found 15 items + drwxrwx--x - hive hive 0 2021-10-26 16:30 /apps + drwxr-xr-x - hdfs hadoop 0 2021-10-18 20:54 /datasets + drwxr-xr-x - hdfs hadoop 0 2021-10-18 20:54 /datastore + drwxrwx---+ - flink hadoop 0 2021-10-18 21:10 /flink + drwxr-x--- - flume hadoop 0 2021-10-18 20:54 /flume + drwxrwx--x - hbase hadoop 0 2021-10-30 07:31 /hbase + ... diff --git a/umn/source/faq/cluster_access/how_do_i_do_if_a_new_node_cannot_be_logged_in_to_as_a_linux_user.rst b/umn/source/faq/cluster_access/how_do_i_do_if_a_new_node_cannot_be_logged_in_to_as_a_linux_user.rst new file mode 100644 index 0000000..7133875 --- /dev/null +++ b/umn/source/faq/cluster_access/how_do_i_do_if_a_new_node_cannot_be_logged_in_to_as_a_linux_user.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1185.html + +.. _mrs_03_1185: + +How Do I Do If a New Node Cannot Be logged In to as a Linux User? +================================================================= + +If you can log in to an existing node as the Linux user but fail to log in to the newly added node, log in to the newly added node as the root user. diff --git a/umn/source/faq/cluster_access/index.rst b/umn/source/faq/cluster_access/index.rst new file mode 100644 index 0000000..ee9323e --- /dev/null +++ b/umn/source/faq/cluster_access/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_2013.html + +.. _mrs_03_2013: + +Cluster Access +============== + +- :ref:`Can I Switch Between the Two Login Modes of MRS? ` +- :ref:`How Can I Obtain the IP Address and Port Number of a ZooKeeper Instance? ` +- :ref:`How Do I Do If a New Node Cannot Be logged In to as a Linux User? ` +- :ref:`How Do I Access an MRS Cluster from a Node Outside the Cluster? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + can_i_switch_between_the_two_login_modes_of_mrs + how_can_i_obtain_the_ip_address_and_port_number_of_a_zookeeper_instance + how_do_i_do_if_a_new_node_cannot_be_logged_in_to_as_a_linux_user + how_do_i_access_an_mrs_cluster_from_a_node_outside_the_cluster diff --git a/umn/source/faq/cluster_management/can_i_add_components_to_an_existing_cluster.rst b/umn/source/faq/cluster_management/can_i_add_components_to_an_existing_cluster.rst new file mode 100644 index 0000000..f04e489 --- /dev/null +++ b/umn/source/faq/cluster_management/can_i_add_components_to_an_existing_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1024.html + +.. _mrs_03_1024: + +Can I Add Components to an Existing Cluster? +============================================ + +You cannot add or remove any component to and from a created cluster of MRS 3.1.0. However, you can create an MRS cluster that contains the required components. diff --git a/umn/source/faq/cluster_management/can_i_change_mrs_cluster_nodes_on_the_mrs_console.rst b/umn/source/faq/cluster_management/can_i_change_mrs_cluster_nodes_on_the_mrs_console.rst new file mode 100644 index 0000000..1cb06db --- /dev/null +++ b/umn/source/faq/cluster_management/can_i_change_mrs_cluster_nodes_on_the_mrs_console.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1034.html + +.. _mrs_03_1034: + +Can I Change MRS Cluster Nodes on the MRS Console? +================================================== + +You cannot change MRS cluster nodes on the MRS console. You are also advised not to change MRS cluster nodes on the ECS console. Manually stopping or deleting an ECS, modifying or reinstalling the ECS OS, or modifying ECS specifications for a cluster node on the ECS console will affect the cluster stability. + +If an ECS is deleted, the ECS OS is modified or reinstalled, or the ECS specifications are modified on the ECS console, MRS will automatically identify and delete the node. You can log in to the MRS console and restore the deleted node through scale-out. Do not perform operations on the nodes that are being scaled out. diff --git a/umn/source/faq/cluster_management/can_i_delete_components_installed_in_an_mrs_cluster.rst b/umn/source/faq/cluster_management/can_i_delete_components_installed_in_an_mrs_cluster.rst new file mode 100644 index 0000000..3e6dddb --- /dev/null +++ b/umn/source/faq/cluster_management/can_i_delete_components_installed_in_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1028.html + +.. _mrs_03_1028: + +Can I Delete Components Installed in an MRS Cluster? +==================================================== + +You cannot delete any component from a created MRS cluster of MRS 3.1.0. If a component is not required, log in to MRS Manager and stop the component on the **Services** page. diff --git a/umn/source/faq/cluster_management/can_i_expand_data_disk_capacity_for_mrs.rst b/umn/source/faq/cluster_management/can_i_expand_data_disk_capacity_for_mrs.rst new file mode 100644 index 0000000..fcb20d2 --- /dev/null +++ b/umn/source/faq/cluster_management/can_i_expand_data_disk_capacity_for_mrs.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1018.html + +.. _mrs_03_1018: + +Can I Expand Data Disk Capacity for MRS? +======================================== + +You can expand data disk capacity for MRS during off-peak hours. + +Expand the EVS disk capacity, and then log in to the ECS and expand the partitions and file system. MRS nodes are installed using public images and support the capacity expansion of in-use EVS disks. diff --git a/umn/source/faq/cluster_management/how_do_i_adjust_the_memory_size_of_the_manager-executor_process.rst b/umn/source/faq/cluster_management/how_do_i_adjust_the_memory_size_of_the_manager-executor_process.rst new file mode 100644 index 0000000..b2aa07f --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_adjust_the_memory_size_of_the_manager-executor_process.rst @@ -0,0 +1,32 @@ +:original_name: mrs_03_1228.html + +.. _mrs_03_1228: + +How Do I Adjust the Memory Size of the manager-executor Process? +================================================================ + +Symptom +------- + +The **manager-executor** process runs either on the Master1 or Master2 node in the MRS cluster in active/standby mode. This process is used to encapsulate the MRS management and control plane's operations on the MRS cluster, such as job submission, heartbeat reporting, certain alarm reporting, as well as cluster creation, scale-out, and scale-in. When you submit jobs on the MRS management and control plane, the Executor memory may become insufficient as the tasks increase or the number of concurrent tasks increases. As a result, the CPU usage is high and the Executor process experiences out-of-memory (OOM) errors. + +Procedure +--------- + +#. Log in to either the Master1 or Master2 node as user **root** and run the following command to switch to user **omm**: + + **su - omm** + +#. Run the following command to modify the **catalina.sh** script. Specifically, search for **JAVA_OPTS** in the script, find the configuration items similar to **JAVA_OPTS="-Xms1024m -Xmx4096m**, and change the values of the items to desired ones, and save the modification. + + **vim /opt/executor/bin/catalina.sh** + +#. The **manager-executor** process only runs on either the Master1 or Master2 node in active/standby mode. Check whether it exists on the node before restarting it. + + a. Log in to the Master1 and Master2 nodes and run the following command to check whether the process exists. If any command output is displayed, the process exists. + + **ps -ef \| grep "/opt/executor" \| grep -v grep** + + b. Run the following command to restart the process: + + **sh /opt/executor/bin/shutdown.shsh** **/opt/executor/bin/startup.sh** diff --git a/umn/source/faq/cluster_management/how_do_i_configure_the_knox_memory.rst b/umn/source/faq/cluster_management/how_do_i_configure_the_knox_memory.rst new file mode 100644 index 0000000..44f15b5 --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_configure_the_knox_memory.rst @@ -0,0 +1,32 @@ +:original_name: mrs_03_1162.html + +.. _mrs_03_1162: + +How Do I Configure the knox Memory? +=================================== + +#. Log in to a Master node of the cluster as user **root**. + +#. Run the following command on the Master node to open the **gateway.sh** file: + + **su omm** + + **vim /opt/knox/bin/gateway.sh** + +#. Change **APP_MEM_OPTS=""** to **APP_MEM_OPTS="-Xms256m -Xmx768m"**, save the file, and exit. + +#. Run the following command on the Master node to restart the knox process: + + **sh /opt/knox/bin/gateway.sh stop** + + **sh /opt/knox/bin/gateway.sh start** + +#. Repeat the preceding steps on each Master node. + +#. Run the **ps -ef \|grep knox** command to check the configured memory. + + + .. figure:: /_static/images/en-us_image_0293101307.png + :alt: **Figure 1** knox memory + + **Figure 1** knox memory diff --git a/umn/source/faq/cluster_management/how_do_i_do_if_the_time_on_mrs_nodes_is_incorrect.rst b/umn/source/faq/cluster_management/how_do_i_do_if_the_time_on_mrs_nodes_is_incorrect.rst new file mode 100644 index 0000000..deb7497 --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_do_if_the_time_on_mrs_nodes_is_incorrect.rst @@ -0,0 +1,34 @@ +:original_name: mrs_03_1211.html + +.. _mrs_03_1211: + +How Do I Do If the Time on MRS Nodes Is Incorrect? +================================================== + +- If the time on a node inside the cluster is incorrect, log in to the node and rectify the fault from :ref:`2 `. +- If the time on a node inside the cluster is different from that on a node outside the cluster, log in to the node and rectify the fault from :ref:`1 `. + +#. .. _mrs_03_1211__li192495514119: + + Run the **vi /etc/ntp.conf** command to edit the NTP client configuration file, add the IP addresses of the master node in the MRS cluster, and comment out the IP address of other servers. + + .. code-block:: + + server master1_ip prefer + server master2_ip + + + .. figure:: /_static/images/en-us_image_0000001439594513.png + :alt: **Figure 1** Adding the master node IP addresses + + **Figure 1** Adding the master node IP addresses + +#. .. _mrs_03_1211__li1924115541119: + + Run the **service ntpd stop** command to stop the NTP service. + +#. Run the **/usr/sbin/ntpdate** *IP address of the active master node* command to manually synchronize time. + +#. Run the **service ntpd start** or **systemctl restart ntpd** command to start the NTP service. + +#. Run the **ntpstat** command to check the time synchronization result: diff --git a/umn/source/faq/cluster_management/how_do_i_do_if_trust_relationships_between_nodes_are_abnormal.rst b/umn/source/faq/cluster_management/how_do_i_do_if_trust_relationships_between_nodes_are_abnormal.rst new file mode 100644 index 0000000..e1d126b --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_do_if_trust_relationships_between_nodes_are_abnormal.rst @@ -0,0 +1,32 @@ +:original_name: mrs_03_1212.html + +.. _mrs_03_1212: + +How Do I Do If Trust Relationships Between Nodes Are Abnormal? +============================================================== + +If "ALM-12066 Inter-Node Mutual Trust Fails" is reported on Manager or there is no SSH trust relationship between nodes, rectify the fault by performing the following operations: + +#. Run the **ssh-add -l** command on both nodes of the trusted cluster to check whether there are identities. + + |image1| + +#. If no identities are displayed, run the **ps -ef|grep ssh-agent** command to find the ssh-agent process, kill the process, and wait for the process to automatically restart. + + |image2| + +#. Run the **ssh-add -l** command to check whether the identities have been added. If yes, manually run the **ssh** command to check whether the trust relationship is normal. + + |image3| + +#. If identities exist, check whether the **authorized_keys** file in the **/home/omm/.ssh** directory contains the information in the **id_rsa.pub** file in the **/home/omm/.ssh** of the peer node. If no, manually add the information about the peer node. + +#. Check whether the permissions on the files in **/home/omm/.ssh** directory are correct. + +#. Check the **/var/log/Bigdata/nodeagent/scriptlog/ssh-agent-monitor.log** file. + +#. If the **home** directory of user **omm** is deleted, contact MRS support personnel. + +.. |image1| image:: /_static/images/en-us_image_0000001184290228.png +.. |image2| image:: /_static/images/en-us_image_0000001229609903.png +.. |image3| image:: /_static/images/en-us_image_0000001229690017.png diff --git a/umn/source/faq/cluster_management/how_do_i_install_kafka_and_flume_in_an_mrs_cluster.rst b/umn/source/faq/cluster_management/how_do_i_install_kafka_and_flume_in_an_mrs_cluster.rst new file mode 100644 index 0000000..82bed7f --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_install_kafka_and_flume_in_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1054.html + +.. _mrs_03_1054: + +How Do I Install Kafka and Flume in an MRS Cluster? +=================================================== + +You cannot install the Kafka and Flume components for a created cluster of MRS 3.1.0 or earlier. Kafka and Flume are components for a streaming cluster. To install Kafka and Flume, create a streaming or hybrid cluster, and install Kafka and Flume. diff --git a/umn/source/faq/cluster_management/how_do_i_query_the_startup_time_of_an_mrs_node.rst b/umn/source/faq/cluster_management/how_do_i_query_the_startup_time_of_an_mrs_node.rst new file mode 100644 index 0000000..0bf4627 --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_query_the_startup_time_of_an_mrs_node.rst @@ -0,0 +1,14 @@ +:original_name: mrs_03_1250.html + +.. _mrs_03_1250: + +How Do I Query the Startup Time of an MRS Node? +=============================================== + +Log in to the target node and run the following command to query the startup time: + +**date -d "$(awk -F. '{print $1}' /proc/uptime) second ago" +"%Y-%m-%d %H:%M:%S"** + +|image1| + +.. |image1| image:: /_static/images/en-us_image_0000001374635732.png diff --git a/umn/source/faq/cluster_management/how_do_i_shield_cluster_alarm_event_notifications.rst b/umn/source/faq/cluster_management/how_do_i_shield_cluster_alarm_event_notifications.rst new file mode 100644 index 0000000..fa2d697 --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_shield_cluster_alarm_event_notifications.rst @@ -0,0 +1,12 @@ +:original_name: mrs_03_1130.html + +.. _mrs_03_1130: + +How Do I Shield Cluster Alarm/Event Notifications? +================================================== + +#. Log in to the MRS console. +#. Click the name of the cluster. +#. On the page displayed, choose **Alarms** > **Notification Rules**. +#. Locate the row that contains the rule you want to modify, click **Edit** in the **Operation** column, and deselect the alarm or event severity levels. +#. Click **OK**. diff --git a/umn/source/faq/cluster_management/how_do_i_stop_an_mrs_cluster.rst b/umn/source/faq/cluster_management/how_do_i_stop_an_mrs_cluster.rst new file mode 100644 index 0000000..7d06f0f --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_stop_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1016.html + +.. _mrs_03_1016: + +How Do I Stop an MRS Cluster? +============================= + +To stop an MRS cluster, stop each node in the cluster on the ECS. Click the name of each node on the **Nodes** tab page to go to the **Elastic Cloud Server** page and click **Stop**. diff --git a/umn/source/faq/cluster_management/how_do_i_view_all_clusters.rst b/umn/source/faq/cluster_management/how_do_i_view_all_clusters.rst new file mode 100644 index 0000000..f6a48ae --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_view_all_clusters.rst @@ -0,0 +1,17 @@ +:original_name: mrs_03_1002.html + +.. _mrs_03_1002: + +How Do I View All Clusters? +=========================== + +You can view all MRS clusters on the **Clusters** page. You can view clusters in different status. + +- **Active Clusters**: all clusters except clusters in **Failed** and **Terminated** states. +- **Cluster History**: clusters in the **Terminated** state. Only the clusters terminated within the last six months are displayed. If you want to view clusters terminated more than six months ago, contact technical support engineers. +- **Failed Tasks**: tasks in **Failed** state. The failed tasks include the following: + + - Tasks failed to create clusters + - Tasks failed to terminate clusters + - Tasks failed to scale out clusters + - Tasks failed to scale in clusters diff --git a/umn/source/faq/cluster_management/how_do_i_view_cluster_configuration_information.rst b/umn/source/faq/cluster_management/how_do_i_view_cluster_configuration_information.rst new file mode 100644 index 0000000..4369722 --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_view_cluster_configuration_information.rst @@ -0,0 +1,9 @@ +:original_name: mrs_03_1004.html + +.. _mrs_03_1004: + +How Do I View Cluster Configuration Information? +================================================ + +- After a cluster is created, click the cluster name on the MRS console. On the page displayed, you can view basic configuration information about the cluster. The instance specifications and node capacity determine the data analysis and processing capability. Higher instance specifications and larger capacity enable faster data processing at a higher cost. +- On the basic information page, click **Access Manager** to access the MRS cluster management page. On MRS Manager, you can view and handle alarms, and modify cluster configuration. diff --git a/umn/source/faq/cluster_management/how_do_i_view_log_information.rst b/umn/source/faq/cluster_management/how_do_i_view_log_information.rst new file mode 100644 index 0000000..cb6a7bc --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_view_log_information.rst @@ -0,0 +1,25 @@ +:original_name: mrs_03_1003.html + +.. _mrs_03_1003: + +How Do I View Log Information? +============================== + +You can view operation logs of clusters and jobs on the **Operation Logs** page. The MRS operation logs record the following operations: + +- Cluster operations + + - Create, terminate, and scale out or in clusters + - Create directories and delete directories or files + +- Job operations: Create, stop, and delete jobs +- Data operations: IAM user tasks, add users, and add user groups + +:ref:`Figure 1 ` shows the operation logs. + +.. _mrs_03_1003__fig36909253103815: + +.. figure:: /_static/images/en-us_image_0000001388541980.png + :alt: **Figure 1** Log information + + **Figure 1** Log information diff --git a/umn/source/faq/cluster_management/how_do_i_view_the_configuration_file_directory_of_each_component.rst b/umn/source/faq/cluster_management/how_do_i_view_the_configuration_file_directory_of_each_component.rst new file mode 100644 index 0000000..6fae08b --- /dev/null +++ b/umn/source/faq/cluster_management/how_do_i_view_the_configuration_file_directory_of_each_component.rst @@ -0,0 +1,41 @@ +:original_name: mrs_03_1198.html + +.. _mrs_03_1198: + +How Do I View the Configuration File Directory of Each Component? +================================================================= + +The configuration file paths of commonly used components are as follows: + ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Component | Configuration File Directory | ++===================================+=====================================================================================+ +| ClickHouse | *Client installation directory*\ **/ClickHouse/clickhouse/config** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Flink | *Client installation directory*\ **/Flink/flink/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Flume | *Client installation directory/*\ **fusioninsight-flume-**\ *xxx*\ **/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| HBase | *Client installation directory*\ **/HBase/hbase/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| HDFS | *Client installation directory*\ **/HDFS/hadoop/logs/hadoop.log** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Hive | *Client installation directory*\ **/Hive/config** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Hudi | *Client installation directory*\ **/Hudi/hudi/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Kafka | *Client installation directory*\ **/Kafka/kafka/config** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Loader | - *Client installation directory*\ **/Loader/loader-tools-xxx/loader-tool/conf** | +| | - *Client installation directory*\ **/Loader/loader-tools-xxx/schedule-tool/conf** | +| | - *Client installation directory*\ **/Loader/loader-tools-xxx/shell-client/conf** | +| | - *Client installation directory*\ **/Loader/loader-tools-xxx/sqoop-shell/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Oozie | *Client installation directory*\ **/Oozie/oozie-client-**\ *xxx*\ **/conf** | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Spark2x | *Client installation directory/Spark2x/spark/conf* | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| Yarn | *Client installation directory/Yarn/config* | ++-----------------------------------+-------------------------------------------------------------------------------------+ +| ZooKeeper | *Client installation directory/Zookeeper/zookeeper/conf* | ++-----------------------------------+-------------------------------------------------------------------------------------+ diff --git a/umn/source/faq/cluster_management/index.rst b/umn/source/faq/cluster_management/index.rst new file mode 100644 index 0000000..8919d5f --- /dev/null +++ b/umn/source/faq/cluster_management/index.rst @@ -0,0 +1,48 @@ +:original_name: mrs_03_2016.html + +.. _mrs_03_2016: + +Cluster Management +================== + +- :ref:`How Do I View All Clusters? ` +- :ref:`How Do I View Log Information? ` +- :ref:`How Do I View Cluster Configuration Information? ` +- :ref:`How Do I Install Kafka and Flume in an MRS Cluster? ` +- :ref:`How Do I Stop an MRS Cluster? ` +- :ref:`Can I Expand Data Disk Capacity for MRS? ` +- :ref:`Can I Add Components to an Existing Cluster? ` +- :ref:`Can I Delete Components Installed in an MRS Cluster? ` +- :ref:`Can I Change MRS Cluster Nodes on the MRS Console? ` +- :ref:`How Do I Shield Cluster Alarm/Event Notifications? ` +- :ref:`Why Is the Resource Pool Memory Displayed in the MRS Cluster Smaller Than the Actual Cluster Memory? ` +- :ref:`How Do I Configure the knox Memory? ` +- :ref:`What Is the Python Version Installed for an MRS Cluster? ` +- :ref:`How Do I View the Configuration File Directory of Each Component? ` +- :ref:`How Do I Do If the Time on MRS Nodes Is Incorrect? ` +- :ref:`How Do I Query the Startup Time of an MRS Node? ` +- :ref:`How Do I Do If Trust Relationships Between Nodes Are Abnormal? ` +- :ref:`How Do I Adjust the Memory Size of the manager-executor Process? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_view_all_clusters + how_do_i_view_log_information + how_do_i_view_cluster_configuration_information + how_do_i_install_kafka_and_flume_in_an_mrs_cluster + how_do_i_stop_an_mrs_cluster + can_i_expand_data_disk_capacity_for_mrs + can_i_add_components_to_an_existing_cluster + can_i_delete_components_installed_in_an_mrs_cluster + can_i_change_mrs_cluster_nodes_on_the_mrs_console + how_do_i_shield_cluster_alarm_event_notifications + why_is_the_resource_pool_memory_displayed_in_the_mrs_cluster_smaller_than_the_actual_cluster_memory + how_do_i_configure_the_knox_memory + what_is_the_python_version_installed_for_an_mrs_cluster + how_do_i_view_the_configuration_file_directory_of_each_component + how_do_i_do_if_the_time_on_mrs_nodes_is_incorrect + how_do_i_query_the_startup_time_of_an_mrs_node + how_do_i_do_if_trust_relationships_between_nodes_are_abnormal + how_do_i_adjust_the_memory_size_of_the_manager-executor_process diff --git a/umn/source/faq/cluster_management/what_is_the_python_version_installed_for_an_mrs_cluster.rst b/umn/source/faq/cluster_management/what_is_the_python_version_installed_for_an_mrs_cluster.rst new file mode 100644 index 0000000..adc8c0e --- /dev/null +++ b/umn/source/faq/cluster_management/what_is_the_python_version_installed_for_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1171.html + +.. _mrs_03_1171: + +What Is the Python Version Installed for an MRS Cluster? +======================================================== + +Log in to a Master node as user **root** and run the **Python3** command to query the Python version. diff --git a/umn/source/faq/cluster_management/why_is_the_resource_pool_memory_displayed_in_the_mrs_cluster_smaller_than_the_actual_cluster_memory.rst b/umn/source/faq/cluster_management/why_is_the_resource_pool_memory_displayed_in_the_mrs_cluster_smaller_than_the_actual_cluster_memory.rst new file mode 100644 index 0000000..0b3718f --- /dev/null +++ b/umn/source/faq/cluster_management/why_is_the_resource_pool_memory_displayed_in_the_mrs_cluster_smaller_than_the_actual_cluster_memory.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1161.html + +.. _mrs_03_1161: + +Why Is the Resource Pool Memory Displayed in the MRS Cluster Smaller Than the Actual Cluster Memory? +==================================================================================================== + +In an MRS cluster, MRS allocates 50% of the cluster memory to Yarn by default. You manage Yarn nodes logically by resource pool. Therefore, the total memory of the resource pool displayed in the cluster is only 50% of the total memory of the cluster. diff --git a/umn/source/faq/cluster_upgrade_patching/can_i_change_the_mrs_cluster_version.rst b/umn/source/faq/cluster_upgrade_patching/can_i_change_the_mrs_cluster_version.rst new file mode 100644 index 0000000..b2166fc --- /dev/null +++ b/umn/source/faq/cluster_upgrade_patching/can_i_change_the_mrs_cluster_version.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1021.html + +.. _mrs_03_1021: + +Can I Change the MRS Cluster Version? +===================================== + +You cannot change the version of an MRS cluster. However, you can terminate the current cluster and create an MRS cluster of the version you require. diff --git a/umn/source/faq/cluster_upgrade_patching/can_i_upgrade_an_mrs_cluster.rst b/umn/source/faq/cluster_upgrade_patching/can_i_upgrade_an_mrs_cluster.rst new file mode 100644 index 0000000..e9032e7 --- /dev/null +++ b/umn/source/faq/cluster_upgrade_patching/can_i_upgrade_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1089.html + +.. _mrs_03_1089: + +Can I Upgrade an MRS Cluster? +============================= + +You cannot upgrade an MRS cluster. However, you can create a cluster of the target version and migrate data from the old cluster to the new cluster. diff --git a/umn/source/faq/cluster_upgrade_patching/index.rst b/umn/source/faq/cluster_upgrade_patching/index.rst new file mode 100644 index 0000000..3bf9159 --- /dev/null +++ b/umn/source/faq/cluster_upgrade_patching/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_2010.html + +.. _mrs_03_2010: + +Cluster Upgrade/Patching +======================== + +- :ref:`Can I Upgrade an MRS Cluster? ` +- :ref:`Can I Change the MRS Cluster Version? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + can_i_upgrade_an_mrs_cluster + can_i_change_the_mrs_cluster_version diff --git a/umn/source/faq/index.rst b/umn/source/faq/index.rst new file mode 100644 index 0000000..573b167 --- /dev/null +++ b/umn/source/faq/index.rst @@ -0,0 +1,42 @@ +:original_name: en-us_topic_0000001349287889.html + +.. _en-us_topic_0000001349287889: + +FAQ +=== + +- :ref:`MRS Overview ` +- :ref:`Account and Password ` +- :ref:`Accounts and Permissions ` +- :ref:`Client Usage ` +- :ref:`Web Page Access ` +- :ref:`Alarm Monitoring ` +- :ref:`Performance Tuning ` +- :ref:`Job Development ` +- :ref:`Cluster Upgrade/Patching ` +- :ref:`Cluster Access ` +- :ref:`Big Data Service Development ` +- :ref:`API ` +- :ref:`Cluster Management ` +- :ref:`Kerberos Usage ` +- :ref:`Metadata Management ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + mrs_overview/index + account_and_password/index + accounts_and_permissions/index + client_usage/index + web_page_access/index + alarm_monitoring/index + performance_tuning/index + job_development/index + cluster_upgrade_patching/index + cluster_access/index + big_data_service_development/index + api/index + cluster_management/index + kerberos_usage/index + metadata_management/index diff --git a/umn/source/faq/job_development/can_i_run_multiple_spark_tasks_at_the_same_time_after_the_minimum_tenant_resources_of_an_mrs_cluster_is_changed_to_0.rst b/umn/source/faq/job_development/can_i_run_multiple_spark_tasks_at_the_same_time_after_the_minimum_tenant_resources_of_an_mrs_cluster_is_changed_to_0.rst new file mode 100644 index 0000000..17b3b20 --- /dev/null +++ b/umn/source/faq/job_development/can_i_run_multiple_spark_tasks_at_the_same_time_after_the_minimum_tenant_resources_of_an_mrs_cluster_is_changed_to_0.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1052.html + +.. _mrs_03_1052: + +Can I Run Multiple Spark Tasks at the Same Time After the Minimum Tenant Resources of an MRS Cluster Is Changed to 0? +===================================================================================================================== + +You can run only one Spark task at a time after the minimum tenant resources of an MRS cluster is changed to 0. diff --git a/umn/source/faq/job_development/data_import_and_export_of_distcp_jobs.rst b/umn/source/faq/job_development/data_import_and_export_of_distcp_jobs.rst new file mode 100644 index 0000000..36fdd00 --- /dev/null +++ b/umn/source/faq/job_development/data_import_and_export_of_distcp_jobs.rst @@ -0,0 +1,14 @@ +:original_name: mrs_03_1238.html + +.. _mrs_03_1238: + +Data Import and Export of DistCP Jobs +===================================== + +- Does a DistCP job compare data consistency during data import and export? + + No. DistCP jobs only copy data but do not modify it. + +- When data is exported from a DistCP job, if some files already exist in OBS, how will the job process the files? + + DistCP jobs will overwrite the files in OBS. diff --git a/umn/source/faq/job_development/how_do_i_do_if_a_flink_job_fails_to_execute_and_the_error_message_java.lang.nosuchfielderror_security_ssl_encrypt_enabled_is_displayed.rst b/umn/source/faq/job_development/how_do_i_do_if_a_flink_job_fails_to_execute_and_the_error_message_java.lang.nosuchfielderror_security_ssl_encrypt_enabled_is_displayed.rst new file mode 100644 index 0000000..7bf10d1 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_a_flink_job_fails_to_execute_and_the_error_message_java.lang.nosuchfielderror_security_ssl_encrypt_enabled_is_displayed.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1215.html + +.. _mrs_03_1215: + +How Do I Do If a Flink Job Fails to Execute and the Error Message "java.lang.NoSuchFieldError: SECURITY_SSL_ENCRYPT_ENABLED" Is Displayed? +========================================================================================================================================== + +Symptom +------- + +A Flink job fails to be executed and the following error message is displayed: + +.. code-block:: + + Caused by: java.lang.NoSuchFieldError: SECURITY_SSL_ENCRYPT_ENABLED + +Solution +-------- + +The third-party dependency package in the customer code conflicts with the cluster package. As a result, the job fails to be submitted to the MRS cluster. You need to modify the dependency package, set the scope of the open source Hadoop package and Flink package in the POM file to **provide**, and pack and execute the job again. diff --git a/umn/source/faq/job_development/how_do_i_do_if_a_sparkstreaming_job_fails_after_being_executed_dozens_of_hours_and_the_obs_access_403_error_is_reported.rst b/umn/source/faq/job_development/how_do_i_do_if_a_sparkstreaming_job_fails_after_being_executed_dozens_of_hours_and_the_obs_access_403_error_is_reported.rst new file mode 100644 index 0000000..81f4c44 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_a_sparkstreaming_job_fails_after_being_executed_dozens_of_hours_and_the_obs_access_403_error_is_reported.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1176.html + +.. _mrs_03_1176: + +How Do I Do If a SparkStreaming Job Fails After Being Executed Dozens of Hours and the OBS Access 403 Error is Reported? +======================================================================================================================== + +When a user submits a job that needs to read and write OBS, the job submission program adds the temporary access key (AK) and secret key (SK) for accessing OBS by default. However, the temporary AK and SK have expiration time. + +If you want to run long-term jobs such as Flink and SparkStreaming, you can enter the AK and SK in **Service Parameter** to ensure that the jobs will not fail to be executed due to key expiration. diff --git a/umn/source/faq/job_development/how_do_i_do_if_an_alarm_is_reported_indicating_that_the_memory_is_insufficient_when_i_execute_a_sql_statement_on_the_clickhouse_client.rst b/umn/source/faq/job_development/how_do_i_do_if_an_alarm_is_reported_indicating_that_the_memory_is_insufficient_when_i_execute_a_sql_statement_on_the_clickhouse_client.rst new file mode 100644 index 0000000..d918b99 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_an_alarm_is_reported_indicating_that_the_memory_is_insufficient_when_i_execute_a_sql_statement_on_the_clickhouse_client.rst @@ -0,0 +1,34 @@ +:original_name: mrs_03_1201.html + +.. _mrs_03_1201: + +How Do I Do If an Alarm Is Reported Indicating that the Memory Is Insufficient When I Execute a SQL Statement on the ClickHouse Client? +======================================================================================================================================= + +Symptom +------- + +The ClickHouse client restricts the memory used by GROUP BY statements. When a SQL statement is executed on the ClickHouse client, the following error information is displayed: + +.. code-block:: + + Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) 6%Received exception from server: + Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1. + DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB: + (while reading column hits): + +Solution +-------- + +- Run the following command before executing an SQL statement on condition that the cluster has sufficient memory: + + .. code-block:: + + SET max_memory_usage = 128000000000; #128G + +- If no sufficient memory is available, ClickHouse enables you to overflow data to disk to free up the memory: You are advised to set the value of **max_memory_usage** to twice the size of **max_bytes_before_external_group_by**. + + .. code-block:: + + set max_bytes_before_external_group_by=20000000000; #20G + set max_memory_usage=40000000000; #40G diff --git a/umn/source/faq/job_development/how_do_i_do_if_error_message_java.io.ioexception_connection_reset_by_peer_is_displayed_during_the_execution_of_a_spark_job.rst b/umn/source/faq/job_development/how_do_i_do_if_error_message_java.io.ioexception_connection_reset_by_peer_is_displayed_during_the_execution_of_a_spark_job.rst new file mode 100644 index 0000000..6f1967a --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_error_message_java.io.ioexception_connection_reset_by_peer_is_displayed_during_the_execution_of_a_spark_job.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1205.html + +.. _mrs_03_1205: + +How Do I Do If Error Message "java.io.IOException: Connection reset by peer" Is Displayed During the Execution of a Spark Job? +============================================================================================================================== + +Symptom +------- + +The Spark job keeps running and error message "java.io.IOException: Connection reset by peer" is displayed. + +Solution +-------- + +Add the **executor.memory Overhead** parameter to the parameters for submitting a job. diff --git a/umn/source/faq/job_development/how_do_i_do_if_error_message_requestid=4971883851071737250_is_displayed_when_a_spark_job_accesses_obs.rst b/umn/source/faq/job_development/how_do_i_do_if_error_message_requestid=4971883851071737250_is_displayed_when_a_spark_job_accesses_obs.rst new file mode 100644 index 0000000..5a8e904 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_error_message_requestid=4971883851071737250_is_displayed_when_a_spark_job_accesses_obs.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1207.html + +.. _mrs_03_1207: + +How Do I Do If Error Message "requestId=4971883851071737250" Is Displayed When a Spark Job Accesses OBS? +======================================================================================================== + +Symptom +------- + +Error message "requestId=4971883851071737250" is displayed when a Spark job accesses OBS. + +Solution +-------- + +Log in to the node where the Spark client is located, go to the **conf** directory, and change the value of the **fs.obs.metrics.switch** parameter in the **core-site.xml** configuration file to **false**. diff --git a/umn/source/faq/job_development/how_do_i_do_if_the_error_message_slot_request_timeout_is_displayed_when_i_submit_a_flink_job.rst b/umn/source/faq/job_development/how_do_i_do_if_the_error_message_slot_request_timeout_is_displayed_when_i_submit_a_flink_job.rst new file mode 100644 index 0000000..6b04b76 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_the_error_message_slot_request_timeout_is_displayed_when_i_submit_a_flink_job.rst @@ -0,0 +1,29 @@ +:original_name: mrs_03_1237.html + +.. _mrs_03_1237: + +How Do I Do If the Error Message "slot request timeout" Is Displayed When I Submit a Flink Job? +=============================================================================================== + +Symptom +------- + +When a Flink job is submitted, JobManager is started successfully. However, TaskManager remains in the starting state until timeout. The following error information is displayed: + +.. code-block:: + + org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources + +Possible Causes +--------------- + +#. The resources in the YARN queue are insufficient. As a result, TaskManager fails to start. +#. Your JAR files conflict with those in the environment. You can execute the WordCount program to determine whether the issue occurs. +#. If the cluster is in security mode, the SSL certificate of Flink may be incorrectly configured or has expired. + +Solution +-------- + +#. Add resources to the YARN queue. +#. Exclude the Flink and Hadoop dependencies in your JAR files so that Flink and Hadoop can depend only on the JAR files in the environment. +#. Reconfigure the SSL certificate of Flink.. diff --git a/umn/source/faq/job_development/how_do_i_do_if_the_flink_job_status_on_the_mrs_console_is_inconsistent_with_that_on_yarn.rst b/umn/source/faq/job_development/how_do_i_do_if_the_flink_job_status_on_the_mrs_console_is_inconsistent_with_that_on_yarn.rst new file mode 100644 index 0000000..4a9afab --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_the_flink_job_status_on_the_mrs_console_is_inconsistent_with_that_on_yarn.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1175.html + +.. _mrs_03_1175: + +How Do I Do If the Flink Job Status on the MRS Console Is Inconsistent with That on Yarn? +========================================================================================= + +To save storage space, the Yarn configuration item **yarn.resourcemanager.max-completed-applications** is modified to reduce the number of historical job records stored on Yarn. Flink jobs are long-term jobs. The realJob is still running on Yarn, but the launcherJob has been deleted. As a result, the launcherJob cannot be found on Yarn, and the job status fails to be updated. This problem is fixed in the 2.1.0.6 patch. + +Workaround: Terminate the job whose launcherJob cannot be found. The status of the job submitted later will be updated. diff --git a/umn/source/faq/job_development/how_do_i_do_if_the_launcher-job_queue_is_stopped_by_yarn_due_to_insufficient_heap_size_when_i_submit_a_flink_job_on_the_management_plane.rst b/umn/source/faq/job_development/how_do_i_do_if_the_launcher-job_queue_is_stopped_by_yarn_due_to_insufficient_heap_size_when_i_submit_a_flink_job_on_the_management_plane.rst new file mode 100644 index 0000000..30823a4 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_the_launcher-job_queue_is_stopped_by_yarn_due_to_insufficient_heap_size_when_i_submit_a_flink_job_on_the_management_plane.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1229.html + +.. _mrs_03_1229: + +How Do I Do If the launcher-job Queue Is Stopped by YARN due to Insufficient Heap Size When I Submit a Flink Job on the Management Plane? +========================================================================================================================================= + +Symptom +------- + +The launcher-job queue is stopped by YARN when a Flink job is submitted on the management plane. + +Solution +-------- + +Increase the heap size of the launcher-job queue. + +#. Log in to the active OMS node as user **omm**. +#. Change the value of **job.launcher.resource.memory.mb** in **/opt/executor/webapps/executor/WEB-INF/classes/servicebroker.xml** to **2048**. +#. Run the **sh /opt/executor/bin/restart-executor.sh** command to restart the executor process. diff --git a/umn/source/faq/job_development/how_do_i_do_if_the_message_the_current_user_does_not_exist_on_mrs_manager._grant_the_user_sufficient_permissions_on_iam_and_then_perform_iam_user_synchronization_on_the_dashboard_tab_page._is_displayed.rst b/umn/source/faq/job_development/how_do_i_do_if_the_message_the_current_user_does_not_exist_on_mrs_manager._grant_the_user_sufficient_permissions_on_iam_and_then_perform_iam_user_synchronization_on_the_dashboard_tab_page._is_displayed.rst new file mode 100644 index 0000000..88e97a1 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_the_message_the_current_user_does_not_exist_on_mrs_manager._grant_the_user_sufficient_permissions_on_iam_and_then_perform_iam_user_synchronization_on_the_dashboard_tab_page._is_displayed.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1173.html + +.. _mrs_03_1173: + +How Do I Do If the Message "The current user does not exist on MRS Manager. Grant the user sufficient permissions on IAM and then perform IAM user synchronization on the Dashboard tab page." Is Displayed? +============================================================================================================================================================================================================ + +If IAM synchronization is not performed when a job is submitted in a security cluster, the error message "The current user does not exist on MRS Manager. Grant the user sufficient permissions on IAM and then perform IAM user synchronization on the Dashboard tab page." is displayed. + +Before submitting a job, on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. diff --git a/umn/source/faq/job_development/how_do_i_do_if_the_spark_job_error_unknownscannerexeception_is_reported.rst b/umn/source/faq/job_development/how_do_i_do_if_the_spark_job_error_unknownscannerexeception_is_reported.rst new file mode 100644 index 0000000..862c882 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_do_if_the_spark_job_error_unknownscannerexeception_is_reported.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_1257.html + +.. _mrs_03_1257: + +How Do I Do If the Spark Job Error "UnknownScannerExeception" Is Reported? +========================================================================== + +Symptom +------- + +Spark jobs run slowly. Warning information is printed in run logs, and the error cause is **UnknownScannerExeception**. + +Solution +-------- + +Before running a Spark job, adjust the value of **hbase.client.scanner.timeout.period** (for example, from 60 seconds to 120 seconds). + +Log in to FusionInsight Manager and choose **Cluster** > **Services** > **HBase**. Click **Configurations** then **All Configurations**, search for **hbase.client.scanner.timeout.period**, and change its value to **120000** (unit: ms). diff --git a/umn/source/faq/job_development/how_do_i_get_my_data_into_obs_or_hdfs.rst b/umn/source/faq/job_development/how_do_i_get_my_data_into_obs_or_hdfs.rst new file mode 100644 index 0000000..41dcf4d --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_get_my_data_into_obs_or_hdfs.rst @@ -0,0 +1,43 @@ +:original_name: mrs_03_1015.html + +.. _mrs_03_1015: + +How Do I Get My Data into OBS or HDFS? +====================================== + +MRS can process data in OBS and HDFS. You can get your data into OBS or HDFS as follows: + +#. Upload local data to OBS. + + a. Log in to the OBS console. + b. Create a parallel file system named **userdata** on OBS and create the **program**, **input**, **output**, and **log** folders in the file system. + + #. Choose **Parallel File System** > **Create Parallel File System**, and create a file system named **userdata**. + #. In the OBS file system list, click the file system name **userdata**, choose **Files** > **Create Folder**, and create the **program**, **input**, **output**, and **log** folders. + + c. Upload data to the **userdata** file system. + + #. Go to the **program** folder and click **Upload File**. + #. Click **add file** and select a user program. + #. Click **Upload**. + #. Upload the user data file to the **input** directory using the same method. + +#. Import OBS data to HDFS. + + You can import OBS data to HDFS only when **Kerberos Authentication** is disabled and the cluster is running. + + a. Log in to the MRS console. + + b. Click the name of the cluster. + + c. On the page displayed, select the **Files** tab page and click **HDFS File List**. + + d. Select a data directory, for example, **bd_app1**. + + The **bd_app1** directory is only an example. You can use any directory on the page or create a new one. + + e. Click **Import Data** and click **Browse** to select an OBS path and an HDFS path. + + f. Click **OK**. + + You can view the file upload progress on the **File Operation Records** tab page. diff --git a/umn/source/faq/job_development/how_do_i_modify_the_hdfs_namespace_fs.defaultfs_of_an_existing_cluster.rst b/umn/source/faq/job_development/how_do_i_modify_the_hdfs_namespace_fs.defaultfs_of_an_existing_cluster.rst new file mode 100644 index 0000000..16086ae --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_modify_the_hdfs_namespace_fs.defaultfs_of_an_existing_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1224.html + +.. _mrs_03_1224: + +How Do I Modify the HDFS NameSpace (fs.defaultFS) of an Existing Cluster? +========================================================================= + +You can modify or add the HDFS NameSpace (fs.defaultFS) of the cluster by modifying the **core-site.xml** and **hdfs-site.xml** files on the client. However, you are not advised to perform this operation on the server. diff --git a/umn/source/faq/job_development/how_do_i_view_mrs_job_logs.rst b/umn/source/faq/job_development/how_do_i_view_mrs_job_logs.rst new file mode 100644 index 0000000..fd2b4f0 --- /dev/null +++ b/umn/source/faq/job_development/how_do_i_view_mrs_job_logs.rst @@ -0,0 +1,25 @@ +:original_name: mrs_03_1172.html + +.. _mrs_03_1172: + +How Do I View MRS Job Logs? +=========================== + +#. .. _mrs_03_1172__li943507288: + + On the **Jobs** page of the MRS console, you can view logs of each job, including launcherJob and realJob logs. + + - Generally, error logs are printed in **stderr** and **stdout** for launcherJob jobs, as shown in the following figure: + + |image1| + + - You can view realJob logs on the ResourceManager web UI provided by the Yarn service on MRS Manager. + + |image2| + +#. Log in to the Master node of the cluster to obtain the job log files in :ref:`1 `. The HDFS path is **/tmp/logs/**\ *{submit_user}*\ **/logs/**\ *{application_id}*. + +#. After the job is submitted, if the job application ID cannot be found on the Yarn web UI, the job fails to be submitted. You can log in to the active Master node of the cluster and view the job submission process log **/var/log/executor/logs/exe.log**. + +.. |image1| image:: /_static/images/en-us_image_0000001151963015.png +.. |image2| image:: /_static/images/en-us_image_0000001438033333.png diff --git a/umn/source/faq/job_development/index.rst b/umn/source/faq/job_development/index.rst new file mode 100644 index 0000000..76adfc8 --- /dev/null +++ b/umn/source/faq/job_development/index.rst @@ -0,0 +1,52 @@ +:original_name: mrs_03_2009.html + +.. _mrs_03_2009: + +Job Development +=============== + +- :ref:`How Do I Get My Data into OBS or HDFS? ` +- :ref:`What Types of Spark Jobs Can Be Submitted in a Cluster? ` +- :ref:`Can I Run Multiple Spark Tasks at the Same Time After the Minimum Tenant Resources of an MRS Cluster Is Changed to 0? ` +- :ref:`What Are the Differences Between the Client Mode and Cluster Mode of Spark Jobs? ` +- :ref:`How Do I View MRS Job Logs? ` +- :ref:`How Do I Do If the Message "The current user does not exist on MRS Manager. Grant the user sufficient permissions on IAM and then perform IAM user synchronization on the Dashboard tab page." Is Displayed? ` +- :ref:`LauncherJob Job Execution Is Failed And the Error Message "jobPropertiesMap is null." Is Displayed ` +- :ref:`How Do I Do If the Flink Job Status on the MRS Console Is Inconsistent with That on Yarn? ` +- :ref:`How Do I Do If a SparkStreaming Job Fails After Being Executed Dozens of Hours and the OBS Access 403 Error is Reported? ` +- :ref:`How Do I Do If an Alarm Is Reported Indicating that the Memory Is Insufficient When I Execute a SQL Statement on the ClickHouse Client? ` +- :ref:`How Do I Do If Error Message "java.io.IOException: Connection reset by peer" Is Displayed During the Execution of a Spark Job? ` +- :ref:`How Do I Do If Error Message "requestId=4971883851071737250" Is Displayed When a Spark Job Accesses OBS? ` +- :ref:`How Do I Do If the Spark Job Error "UnknownScannerExeception" Is Reported? ` +- :ref:`Why DataArtsStudio Occasionally Fail to Schedule Spark Jobs and the Rescheduling also Fails? ` +- :ref:`How Do I Do If a Flink Job Fails to Execute and the Error Message "java.lang.NoSuchFieldError: SECURITY_SSL_ENCRYPT_ENABLED" Is Displayed? ` +- :ref:`Why Submitted Yarn Job Cannot Be Viewed on the Web UI? ` +- :ref:`How Do I Modify the HDFS NameSpace (fs.defaultFS) of an Existing Cluster? ` +- :ref:`How Do I Do If the launcher-job Queue Is Stopped by YARN due to Insufficient Heap Size When I Submit a Flink Job on the Management Plane? ` +- :ref:`How Do I Do If the Error Message "slot request timeout" Is Displayed When I Submit a Flink Job? ` +- :ref:`Data Import and Export of DistCP Jobs ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_get_my_data_into_obs_or_hdfs + what_types_of_spark_jobs_can_be_submitted_in_a_cluster + can_i_run_multiple_spark_tasks_at_the_same_time_after_the_minimum_tenant_resources_of_an_mrs_cluster_is_changed_to_0 + what_are_the_differences_between_the_client_mode_and_cluster_mode_of_spark_jobs + how_do_i_view_mrs_job_logs + how_do_i_do_if_the_message_the_current_user_does_not_exist_on_mrs_manager._grant_the_user_sufficient_permissions_on_iam_and_then_perform_iam_user_synchronization_on_the_dashboard_tab_page._is_displayed + launcherjob_job_execution_is_failed_and_the_error_message_jobpropertiesmap_is_null._is_displayed + how_do_i_do_if_the_flink_job_status_on_the_mrs_console_is_inconsistent_with_that_on_yarn + how_do_i_do_if_a_sparkstreaming_job_fails_after_being_executed_dozens_of_hours_and_the_obs_access_403_error_is_reported + how_do_i_do_if_an_alarm_is_reported_indicating_that_the_memory_is_insufficient_when_i_execute_a_sql_statement_on_the_clickhouse_client + how_do_i_do_if_error_message_java.io.ioexception_connection_reset_by_peer_is_displayed_during_the_execution_of_a_spark_job + how_do_i_do_if_error_message_requestid=4971883851071737250_is_displayed_when_a_spark_job_accesses_obs + how_do_i_do_if_the_spark_job_error_unknownscannerexeception_is_reported + why_dataartsstudio_occasionally_fail_to_schedule_spark_jobs_and_the_rescheduling_also_fails + how_do_i_do_if_a_flink_job_fails_to_execute_and_the_error_message_java.lang.nosuchfielderror_security_ssl_encrypt_enabled_is_displayed + why_submitted_yarn_job_cannot_be_viewed_on_the_web_ui + how_do_i_modify_the_hdfs_namespace_fs.defaultfs_of_an_existing_cluster + how_do_i_do_if_the_launcher-job_queue_is_stopped_by_yarn_due_to_insufficient_heap_size_when_i_submit_a_flink_job_on_the_management_plane + how_do_i_do_if_the_error_message_slot_request_timeout_is_displayed_when_i_submit_a_flink_job + data_import_and_export_of_distcp_jobs diff --git a/umn/source/faq/job_development/launcherjob_job_execution_is_failed_and_the_error_message_jobpropertiesmap_is_null._is_displayed.rst b/umn/source/faq/job_development/launcherjob_job_execution_is_failed_and_the_error_message_jobpropertiesmap_is_null._is_displayed.rst new file mode 100644 index 0000000..ec754bb --- /dev/null +++ b/umn/source/faq/job_development/launcherjob_job_execution_is_failed_and_the_error_message_jobpropertiesmap_is_null._is_displayed.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1174.html + +.. _mrs_03_1174: + +LauncherJob Job Execution Is Failed And the Error Message "jobPropertiesMap is null." Is Displayed +================================================================================================== + +The cause of the launcherJob failure is that the user who submits the job does not have the write permission on the **hdfs /mrs/job-properties** directory. + +This problem is fixed in the 2.1.0.6 patch. You can also grant the write permission on the **/mrs/job-properties** directory to the synchronized user who submits the job on MRS Manager. diff --git a/umn/source/faq/job_development/what_are_the_differences_between_the_client_mode_and_cluster_mode_of_spark_jobs.rst b/umn/source/faq/job_development/what_are_the_differences_between_the_client_mode_and_cluster_mode_of_spark_jobs.rst new file mode 100644 index 0000000..6c5ede4 --- /dev/null +++ b/umn/source/faq/job_development/what_are_the_differences_between_the_client_mode_and_cluster_mode_of_spark_jobs.rst @@ -0,0 +1,14 @@ +:original_name: en-us_topic_0000001145356345.html + +.. _en-us_topic_0000001145356345: + +What Are the Differences Between the Client Mode and Cluster Mode of Spark Jobs? +================================================================================ + +You need to understand the concept ApplicationMaster before understanding the essential differences between Yarn-client and Yarn-cluster. + +In Yarn, each application instance has an ApplicationMaster process, which is the first container started by the application. It interacts with ResourceManager and requests resources. After obtaining resources, it instructs NodeManager to start containers. The essential difference between the Yarn-cluster and Yarn-client modes lies in the ApplicationMaster process. + +In Yarn-cluster mode, Driver runs in ApplicationMaster, which requests resources from Yarn and monitors the running status of a job. After a user submits a job, the client can be stopped and the job continues running on Yarn. Therefore, the Yarn-cluster mode is not suitable for running interactive jobs. + +In Yarn-client mode, ApplicationMaster requests only Executor from Yarn. The client communicates with the requested containers to schedule tasks. Therefore, the client cannot be stopped. diff --git a/umn/source/faq/job_development/what_types_of_spark_jobs_can_be_submitted_in_a_cluster.rst b/umn/source/faq/job_development/what_types_of_spark_jobs_can_be_submitted_in_a_cluster.rst new file mode 100644 index 0000000..8df8974 --- /dev/null +++ b/umn/source/faq/job_development/what_types_of_spark_jobs_can_be_submitted_in_a_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1050.html + +.. _mrs_03_1050: + +What Types of Spark Jobs Can Be Submitted in a Cluster? +======================================================= + +MRS clusters support Spark jobs submitted in Spark, Spark Script, or Spark SQL mode. diff --git a/umn/source/faq/job_development/why_dataartsstudio_occasionally_fail_to_schedule_spark_jobs_and_the_rescheduling_also_fails.rst b/umn/source/faq/job_development/why_dataartsstudio_occasionally_fail_to_schedule_spark_jobs_and_the_rescheduling_also_fails.rst new file mode 100644 index 0000000..cb10ca5 --- /dev/null +++ b/umn/source/faq/job_development/why_dataartsstudio_occasionally_fail_to_schedule_spark_jobs_and_the_rescheduling_also_fails.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1208.html + +.. _mrs_03_1208: + +Why DataArtsStudio Occasionally Fail to Schedule Spark Jobs and the Rescheduling also Fails? +============================================================================================ + +Symptom +------- + +DataArtsStudio occasionally fails to schedule Spark jobs and the rescheduling also fails. The following error information is displayed: + +.. code-block:: + + Caused by: org.apache.spark.SparkException: Application application_1619511926396_2586346 finished with failed status + +Solution +-------- + +Log in to the node where the Spark client is located as user **root** and increase the value of the **spark.driver.memory** parameter in the **spark-defaults.conf** file. diff --git a/umn/source/faq/job_development/why_submitted_yarn_job_cannot_be_viewed_on_the_web_ui.rst b/umn/source/faq/job_development/why_submitted_yarn_job_cannot_be_viewed_on_the_web_ui.rst new file mode 100644 index 0000000..63a78f2 --- /dev/null +++ b/umn/source/faq/job_development/why_submitted_yarn_job_cannot_be_viewed_on_the_web_ui.rst @@ -0,0 +1,11 @@ +:original_name: mrs_03_1223.html + +.. _mrs_03_1223: + +Why Submitted Yarn Job Cannot Be Viewed on the Web UI? +====================================================== + +After a Yarn job is created, it cannot be viewed if you log in to the web UI as the **admin** user. + +- The **admin** user is a user on the cluster management page. Check whether the user has the **supergroup** permission. Generally, only the user with the **supergroup** permission can view jobs. +- Log in to Yarn as the user who submits jobs to view jobs on Yarn. Do not view the jobs using the **admin** user. diff --git a/umn/source/faq/kerberos_usage/how_do_i_access_hive_in_a_cluster_with_kerberos_authentication_enabled.rst b/umn/source/faq/kerberos_usage/how_do_i_access_hive_in_a_cluster_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..f85172e --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_access_hive_in_a_cluster_with_kerberos_authentication_enabled.rst @@ -0,0 +1,30 @@ +:original_name: mrs_03_1152.html + +.. _mrs_03_1152: + +How Do I Access Hive in a Cluster with Kerberos Authentication Enabled? +======================================================================= + +#. Log in to the master node in the cluster as user **root**. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user: + + **kinit** *MRS cluster user* + + Example: **kinit hiveuser** + + The current user must have the permission to create Hive tables.. + +#. Run the client command of the Hive component. + + **beeline** + +#. Run the Hive command in Beeline, for example: + + **create table test_obs(a int, b string) row format delimited fields terminated by "," stored as textfile location "obs://test_obs";** + +#. Press **Ctrl+C** to exit the Hive Beeline. diff --git a/umn/source/faq/kerberos_usage/how_do_i_access_presto_in_a_cluster_with_kerberos_authentication_enabled.rst b/umn/source/faq/kerberos_usage/how_do_i_access_presto_in_a_cluster_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..f41db22 --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_access_presto_in_a_cluster_with_kerberos_authentication_enabled.rst @@ -0,0 +1,65 @@ +:original_name: mrs_03_1153.html + +.. _mrs_03_1153: + +How Do I Access Presto in a Cluster with Kerberos Authentication Enabled? +========================================================================= + +#. Log in to the Master node in the cluster as user **root**. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. Access Presto in a cluster with Kerberos authentication enabled. + + a. .. _mrs_03_1153__li251015403210: + + Log in to MRS Manager and create a role with the **Hive Admin Privilege** permission, for example, **prestorerole**. + + b. .. _mrs_03_1153__li55542531841: + + Create a user, for example, **presto001**, who belongs to the **Presto** and **Hive** groups, and bind the user to the role created in :ref:`3.a `. + + c. Authenticate user **presto001**. + + **kinit presto001** + + d. Download the user authentication credential. + + - Operations on MRS Manager: Log in to MRS Manager and choose **System** > **Manage User**. Locate the row containing the new user, click **More**, and select **Download authentication credential**. + + - Operations on FusionInsight Manager: + + Log in to FusionInsight Manager, choose **System** > **Permission** > **User**. On the displayed page, locate the row that contains the user, choose **More** > **Download Authentication Credential**. + + e. .. _mrs_03_1153__li65281811161910: + + Decompress the downloaded user credential file, and save the obtained **krb5.conf** and **user.keytab** files to the client directory, for example, /opt\ **/client/Presto/**. + + f. .. _mrs_03_1153__li165280118198: + + Run the following command to obtain the user principal: + + **klist -kt /opt/client/Presto/user.keytab** + + g. Run the following command to connect to the Presto Server of the cluster: + + **presto_cli.sh --krb5-config-path** *{krb5.conf file path}* **--krb5-principal** *{User's principal}* **--krb5-keytab-path** *{user.keytab file path}* **--user** *{presto username}* + + - **krb5.conf** *file path*: file path set in :ref:`3.e `, for example, **/opt/client/Presto/krb5.conf**. + - **user.keytab** *file path*: file path set in :ref:`3.e `, for example, **/opt/client/Presto/user.keytab**. + - *User's principal*: principal obtained in :ref:`3.f `. + - *presto username*: user created in :ref:`3.b `, for example, **presto001**. + + Example: **presto_cli.sh --krb5-config-path /opt/client/Presto/krb5.conf --krb5-principal prest001@xxx_xxx_xxx_xxx.COM --krb5-keytab-path /opt/client/Presto/user.keytab --user presto001** + + h. On the Presto client, run the following statement to create a schema: + + **CREATE SCHEMA hive.demo01 WITH (location = 'obs://presto-demo002/');** + + i. Create a table in the schema. The table data is stored in the OBS bucket, as shown in the following example: + + **CREATE TABLE hive.demo01.demo_table WITH (format = 'ORC') AS SELECT \* FROM tpch.sf1.customer;** + + j. Run **exit** to exit the client. diff --git a/umn/source/faq/kerberos_usage/how_do_i_access_spark_in_a_cluster_with_kerberos_authentication_enabled.rst b/umn/source/faq/kerberos_usage/how_do_i_access_spark_in_a_cluster_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..02f3089 --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_access_spark_in_a_cluster_with_kerberos_authentication_enabled.rst @@ -0,0 +1,36 @@ +:original_name: en-us_topic_0000001152828323.html + +.. _en-us_topic_0000001152828323: + +How Do I Access Spark in a Cluster with Kerberos Authentication Enabled? +======================================================================== + +#. Log in to the master node in the cluster as user **root**. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. + + **kinit** *MRS cluster user* + + Example: + + If the development user is a machine-machine user, run **kinit -kt user.keytab sparkuser**. + + If the development user is a human-machine user, run **kinit sparkuser**. + +#. Run the following command to connect to Spark Beeline: + + **spark-beeline** + +#. Run commands on Spark Beeline. For example, create the table **test** in the **obs://mrs-word001/table/** directory. + + **create table test(id int) location 'obs://mrs-word001/table/';** + +#. Run the following command to query all tables. If table **test** is displayed in the command output, OBS access is successful. + + **show tables;** + +#. Press **Ctrl+C** to exit Spark Beeline. diff --git a/umn/source/faq/kerberos_usage/how_do_i_change_the_kerberos_authentication_status_of_a_created_mrs_cluster.rst b/umn/source/faq/kerberos_usage/how_do_i_change_the_kerberos_authentication_status_of_a_created_mrs_cluster.rst new file mode 100644 index 0000000..c52cf6a --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_change_the_kerberos_authentication_status_of_a_created_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1038.html + +.. _mrs_03_1038: + +How Do I Change the Kerberos Authentication Status of a Created MRS Cluster? +============================================================================ + +You cannot change the Kerberos service after an MRS cluster is created. diff --git a/umn/source/faq/kerberos_usage/how_do_i_deploy_the_kerberos_service_in_a_running_cluster.rst b/umn/source/faq/kerberos_usage/how_do_i_deploy_the_kerberos_service_in_a_running_cluster.rst new file mode 100644 index 0000000..eb81672 --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_deploy_the_kerberos_service_in_a_running_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1148.html + +.. _mrs_03_1148: + +How Do I Deploy the Kerberos Service in a Running Cluster? +========================================================== + +The MRS cluster does not support customized Kerberos installation and deployment, and the Kerberos authentication cannot be set up between components. To enable Kerberos authentication, you need to create a cluster with Kerberos enabled and migrate data. diff --git a/umn/source/faq/kerberos_usage/how_do_i_prevent_kerberos_authentication_expiration.rst b/umn/source/faq/kerberos_usage/how_do_i_prevent_kerberos_authentication_expiration.rst new file mode 100644 index 0000000..69190d1 --- /dev/null +++ b/umn/source/faq/kerberos_usage/how_do_i_prevent_kerberos_authentication_expiration.rst @@ -0,0 +1,40 @@ +:original_name: mrs_03_1167.html + +.. _mrs_03_1167: + +How Do I Prevent Kerberos Authentication Expiration? +==================================================== + +- Java applications: + + Before connecting to HBase, HDFS, or other big data components, call loginUserFromKeytab() to create a UGI. Then, start a scheduled thread to periodically check whether the Kerberos Authentication expires. Log in to the system again before the Kerberos Authentication expires. + + .. code-block:: + + private static void startCheckKeytabTgtAndReloginJob() { + //The credential is checked every 10 minutes, and updated before the expiration time. + ThreadPool.updateConfigThread.scheduleWithFixedDelay(() -> { + try { + UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab(); + logger.warn("get tgt:{}", UserGroupInformation.getLoginUser().getTGT()); + logger.warn("Check Kerberos Tgt And Relogin From Keytab Finish."); + } catch (IOException e) { + logger.error("Check Kerberos Tgt And Relogin From Keytab Error", e); + } + }, 0, 10, TimeUnit.MINUTES); + logger.warn("Start Check Keytab TGT And Relogin Job Success."); + } + +- Tasks executed in shell mode: + + #. Run the **kinit** command to authenticate the user. + #. Create a scheduled task of the operating system or any other scheduled task to run the **kinit** command to authenticate the user periodically. + #. Submit jobs to execute big data tasks. + +- Spark jobs: + + If you submit jobs using spark-shell, spark-submit, or spark-sql, you can specify **Keytab** and **Principal** in the command to perform authentication and periodically update the login credential and authorization tokens to prevent authentication expiration. + + Example: + + **spark-shell --principal spark2x/hadoop.**\ <*System domain name*>@<*System domain name*>\ ** --keytab ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-2.4.5/keytab/spark2x/SparkResource/spark2x.keytab --master yarn** diff --git a/umn/source/faq/kerberos_usage/index.rst b/umn/source/faq/kerberos_usage/index.rst new file mode 100644 index 0000000..e2f5582 --- /dev/null +++ b/umn/source/faq/kerberos_usage/index.rst @@ -0,0 +1,26 @@ +:original_name: mrs_03_2018.html + +.. _mrs_03_2018: + +Kerberos Usage +============== + +- :ref:`How Do I Change the Kerberos Authentication Status of a Created MRS Cluster? ` +- :ref:`What Are the Ports of the Kerberos Authentication Service? ` +- :ref:`How Do I Deploy the Kerberos Service in a Running Cluster? ` +- :ref:`How Do I Access Hive in a Cluster with Kerberos Authentication Enabled? ` +- :ref:`How Do I Access Presto in a Cluster with Kerberos Authentication Enabled? ` +- :ref:`How Do I Access Spark in a Cluster with Kerberos Authentication Enabled? ` +- :ref:`How Do I Prevent Kerberos Authentication Expiration? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_change_the_kerberos_authentication_status_of_a_created_mrs_cluster + what_are_the_ports_of_the_kerberos_authentication_service + how_do_i_deploy_the_kerberos_service_in_a_running_cluster + how_do_i_access_hive_in_a_cluster_with_kerberos_authentication_enabled + how_do_i_access_presto_in_a_cluster_with_kerberos_authentication_enabled + how_do_i_access_spark_in_a_cluster_with_kerberos_authentication_enabled + how_do_i_prevent_kerberos_authentication_expiration diff --git a/umn/source/faq/kerberos_usage/what_are_the_ports_of_the_kerberos_authentication_service.rst b/umn/source/faq/kerberos_usage/what_are_the_ports_of_the_kerberos_authentication_service.rst new file mode 100644 index 0000000..e2b65ca --- /dev/null +++ b/umn/source/faq/kerberos_usage/what_are_the_ports_of_the_kerberos_authentication_service.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1131.html + +.. _mrs_03_1131: + +What Are the Ports of the Kerberos Authentication Service? +========================================================== + +The Kerberos authentication service uses ports 21730 (TCP), 21731 (TCP/UDP), and 21732 (TCP/UDP). diff --git a/umn/source/faq/metadata_management/index.rst b/umn/source/faq/metadata_management/index.rst new file mode 100644 index 0000000..7180f1e --- /dev/null +++ b/umn/source/faq/metadata_management/index.rst @@ -0,0 +1,14 @@ +:original_name: mrs_03_2019.html + +.. _mrs_03_2019: + +Metadata Management +=================== + +- :ref:`Where Can I View Hive Metadata? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + where_can_i_view_hive_metadata diff --git a/umn/source/faq/metadata_management/where_can_i_view_hive_metadata.rst b/umn/source/faq/metadata_management/where_can_i_view_hive_metadata.rst new file mode 100644 index 0000000..c769dda --- /dev/null +++ b/umn/source/faq/metadata_management/where_can_i_view_hive_metadata.rst @@ -0,0 +1,23 @@ +:original_name: mrs_03_1119.html + +.. _mrs_03_1119: + +Where Can I View Hive Metadata? +=============================== + +- If Hive metadata is stored in GaussDB of an MRS cluster, log in to the master DBServer node of the cluster, switch to user **omm**, and run the **gsql -p 20051 -U {USER} -W {PASSWD} -d hivemeta** command to view the metadata. +- If Hive metadata is stored in an external relational database, perform the following steps: + + #. On the cluster **Dashboard** page, click **Manage** on the right of **Data Connection**. + + #. .. _mrs_03_1119__li1120232084010: + + On the displayed page, obtain the value of **Data Connection ID**. + + #. On the MRS console, click **Data Connections**. + + #. In the data connection list, locate the data connection based on the data connection ID obtained in :ref:`2 `. + + #. Click **Edit** in the **Operation** column of the data connection. + + The **RDS Instance** and **Database** indicate the relational database in which the Hive metadata is stored. diff --git a/umn/source/faq/mrs_overview/can_i_change_the_ip_address_of_dbservice.rst b/umn/source/faq/mrs_overview/can_i_change_the_ip_address_of_dbservice.rst new file mode 100644 index 0000000..2309567 --- /dev/null +++ b/umn/source/faq/mrs_overview/can_i_change_the_ip_address_of_dbservice.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1137.html + +.. _mrs_03_1137: + +Can I Change the IP address of DBService? +========================================= + +MRS does not support the change of the DBService IP address. diff --git a/umn/source/faq/mrs_overview/can_i_clear_mrs_sudo_logs.rst b/umn/source/faq/mrs_overview/can_i_clear_mrs_sudo_logs.rst new file mode 100644 index 0000000..a2d11ee --- /dev/null +++ b/umn/source/faq/mrs_overview/can_i_clear_mrs_sudo_logs.rst @@ -0,0 +1,14 @@ +:original_name: mrs_03_1155.html + +.. _mrs_03_1155: + +Can I Clear MRS sudo Logs? +========================== + +MRS sudo log files record operations performed by user **omm** and are helpful for fault locating. You can delete the logs of the earliest date to release storage space. + +#. If the log file is large, add the log file directory to **/etc/logrotate.d/syslog** to enable the system to periodically delete logs. + + Method: Run **sed -i '3 a/var/log/sudo/sudo.log' /etc/logrotate.d/syslog**. + +#. Set the maximum number and size of logs in **/etc/logrotate.d/syslog**. If the number or size of logs exceeds the threshold, the logs will be automatically deleted. By default, logs are aged based on the size and number of archived logs. You can use **size** and **rotate** to limit the size and number of archived logs, respectively. If required, you can also add **daily**/**weekly**/**monthly** to specify how often the logs are cleared. diff --git a/umn/source/faq/mrs_overview/can_i_configure_a_phoenix_connection_pool.rst b/umn/source/faq/mrs_overview/can_i_configure_a_phoenix_connection_pool.rst new file mode 100644 index 0000000..eac490d --- /dev/null +++ b/umn/source/faq/mrs_overview/can_i_configure_a_phoenix_connection_pool.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1105.html + +.. _mrs_03_1105: + +Can I Configure a Phoenix Connection Pool? +========================================== + +Phoenix does not support connection pool configuration. You are advised to write code to implement a tool class for managing connections and simulate a connection pool. diff --git a/umn/source/faq/mrs_overview/can_i_downgrade_the_specifications_of_an_mrs_cluster_node.rst b/umn/source/faq/mrs_overview/can_i_downgrade_the_specifications_of_an_mrs_cluster_node.rst new file mode 100644 index 0000000..7cd7c29 --- /dev/null +++ b/umn/source/faq/mrs_overview/can_i_downgrade_the_specifications_of_an_mrs_cluster_node.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1125.html + +.. _mrs_03_1125: + +Can I Downgrade the Specifications of an MRS Cluster Node? +========================================================== + +You cannot downgrade the specifications of an MRS cluster node by using the console. If you want to downgrade an MRS cluster node's specifications, contact technical support. diff --git a/umn/source/faq/mrs_overview/differences_and_relationships_between_the_mrs_management_console_and_cluster_manager.rst b/umn/source/faq/mrs_overview/differences_and_relationships_between_the_mrs_management_console_and_cluster_manager.rst new file mode 100644 index 0000000..82967a4 --- /dev/null +++ b/umn/source/faq/mrs_overview/differences_and_relationships_between_the_mrs_management_console_and_cluster_manager.rst @@ -0,0 +1,53 @@ +:original_name: mrs_03_1233.html + +.. _mrs_03_1233: + +Differences and Relationships Between the MRS Management Console and Cluster Manager +==================================================================================== + +You can access Manager from the MRS management console. + +Manager is classified as MRS Manager and FusionInsight Manager. + +- MRS Manager is the manager page of MRS 2.\ *x* or earlier clusters. +- FusionInsight Manager is the manager page of MRS 3.\ *x* or later clusters. + +The following table lists the differences and relationships between the management console and FusionInsight Manager. + ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Common Operation | MRS Console | FusionInsight Manager | ++============================================================================================================================+===============+=======================+ +| Changing subnets, adding security group rules, controlling OBS permissions, managing agencies, and synchronizing IAM users | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Adding node groups, scaling out, scaling in, and upgrading specifications | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Isolating hosts, starting all roles, and stopping all roles | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Downloading the client, starting services, stopping services, and perform rolling restart of services | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Viewing the instance status of services, configuring parameters, and synchronizing configurations | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Viewing cleared alarms and events | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Viewing the alarm help | Not supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Setting thresholds | Not supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Adding message subscription specifications | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Managing files | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Managing jobs | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Managing tenants | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Managing tags | Supported | Not supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Managing permissions (adding and deleting users, user groups, and roles, and changing passwords) | Not supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Performing backup and restoration | Not supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Auditing | Not supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ +| Monitoring resources and logging | Supported | Supported | ++----------------------------------------------------------------------------------------------------------------------------+---------------+-----------------------+ diff --git a/umn/source/faq/mrs_overview/does_an_mrs_cluster_support_hive_on_spark.rst b/umn/source/faq/mrs_overview/does_an_mrs_cluster_support_hive_on_spark.rst new file mode 100644 index 0000000..2ba4d9d --- /dev/null +++ b/umn/source/faq/mrs_overview/does_an_mrs_cluster_support_hive_on_spark.rst @@ -0,0 +1,10 @@ +:original_name: mrs_03_1048.html + +.. _mrs_03_1048: + +Does an MRS Cluster Support Hive on Spark? +========================================== + +- Clusters of MRS 1.9.\ *x* support Hive on Spark. +- Clusters of MRS 3.\ *x* or later support Hive on Spark. +- You can use Hive on Tez for the clusters of other versions. diff --git a/umn/source/faq/mrs_overview/does_mrs_support_change_of_the_network_segment.rst b/umn/source/faq/mrs_overview/does_mrs_support_change_of_the_network_segment.rst new file mode 100644 index 0000000..a7c79d2 --- /dev/null +++ b/umn/source/faq/mrs_overview/does_mrs_support_change_of_the_network_segment.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1019.html + +.. _mrs_03_1019: + +Does MRS Support Change of the Network Segment? +=============================================== + +You can change the network segment. On the cluster **Dashboard** page of MRS console, click **Change Subnet** to the right of **Default Subnet**, and select a subnet in the VPC of the cluster to expand subnet IP addresses. Selecting a new subnet will not change the IP addresses and subnets of existing nodes. diff --git a/umn/source/faq/mrs_overview/does_mrs_support_running_hive_on_kudu.rst b/umn/source/faq/mrs_overview/does_mrs_support_running_hive_on_kudu.rst new file mode 100644 index 0000000..e378550 --- /dev/null +++ b/umn/source/faq/mrs_overview/does_mrs_support_running_hive_on_kudu.rst @@ -0,0 +1,13 @@ +:original_name: mrs_03_1068.html + +.. _mrs_03_1068: + +Does MRS Support Running Hive on Kudu? +====================================== + +MRS does not support Hive on Kudu. + +Currently, MRS supports only the following two methods to access Kudu: + +- Access Kudu through Impala tables. +- Access and operate Kudu tables using the client application. diff --git a/umn/source/faq/mrs_overview/how_do_i_create_an_mrs_cluster_using_a_custom_security_group.rst b/umn/source/faq/mrs_overview/how_do_i_create_an_mrs_cluster_using_a_custom_security_group.rst new file mode 100644 index 0000000..64e6908 --- /dev/null +++ b/umn/source/faq/mrs_overview/how_do_i_create_an_mrs_cluster_using_a_custom_security_group.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1012.html + +.. _mrs_03_1012: + +How Do I Create an MRS Cluster Using a Custom Security Group? +============================================================= + +If you want to use a self-defined security group when buying a cluster, you need to enable port 9022 or select **Auto create** in **Security Group** on the MRS console. diff --git a/umn/source/faq/mrs_overview/how_do_i_enable_different_service_programs_to_use_different_yarn_queues.rst b/umn/source/faq/mrs_overview/how_do_i_enable_different_service_programs_to_use_different_yarn_queues.rst new file mode 100644 index 0000000..1f11bb3 --- /dev/null +++ b/umn/source/faq/mrs_overview/how_do_i_enable_different_service_programs_to_use_different_yarn_queues.rst @@ -0,0 +1,106 @@ +:original_name: mrs_03_1221.html + +.. _mrs_03_1221: + +How Do I Enable Different Service Programs to Use Different YARN Queues? +======================================================================== + +Create a tenant on Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, select a parent tenant and click |image1|. On the page for adding a sub-tenant, set attributes for the sub-tenant according to :ref:`Table 1 `. + + .. _mrs_03_1221__admin_guide_000119_tc983b52ccd084798871c7fa2b49856dd: + + .. table:: **Table 1** Sub-tenant parameters + + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +========================================+==========================================================================================================================================================================================================================================================================================+ + | Cluster | Indicates the cluster to which the parent tenant belongs. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parent Tenant Resource | Indicates the name of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | - Indicates the name of the current tenant. The value consists of 3 to 50 characters, including digits, letters, and underscores (_). | + | | - Plan a sub-tenant name based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | Specifies whether the tenant is a leaf tenant. | + | | | + | | - When **Leaf Tenant** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. | + | | - When **Non-leaf Tenant** is selected, the current tenant is not a leaf tenant and sub-tenants can be added to the current tenant. However, the tenant depth cannot exceed 5 levels. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Computing Resource | Specifies the dynamic computing resources for the current tenant. | + | | | + | | - When **Yarn** is selected, the system automatically creates a queue in Yarn and the queue is named the same as the sub-tenant name. | + | | | + | | - A leaf tenant can directly submit jobs to the queue. | + | | - A non-leaf tenant cannot directly submit jobs to the queue. However, Yarn adds an extra queue (hidden) named **default** for the non-leaf tenant to record the remaining resource capacity of the tenant. Actual jobs do not run in this queue. | + | | | + | | - If **Yarn** is not selected, the system does not automatically create a queue. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Indicates the percentage of computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max Capacity (%) | Indicates the maximum percentage of computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. | + | | | + | | - When **HDFS** is selected, the system automatically creates a folder named after the sub-tenant in the HDFS parent tenant directory. | + | | - When **HDFS** is not selected, the system does not automatically allocate storage resources. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Quota | Indicates the quota for files and directories. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota | Indicates the quota for the HDFS storage space used by the current tenant. | + | | | + | | - If the unit is set to **MB**, the value ranges from **1** to **8796093022208**. If the unit is set to **GB**, the value ranges from **1** to **8589934592**. | + | | - This parameter indicates the maximum HDFS storage space that can be used by the tenant, but not the actual space used. | + | | - If its value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + | | - If this quota is greater than the quota of the parent tenant, the actual storage space does not exceed the quota of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Indicates the HDFS storage directory for the tenant. | + | | | + | | - The system automatically creates a folder named after the sub-tenant name in the directory of the parent tenant by default. For example, if the sub-tenant is **ta1s** and the parent directory is **/tenant/ta1**, the storage path for the sub-tenant is then **/tenant/ta1/ta1s**. | + | | - The storage path is customizable in the parent directory. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Roles, computing resources, and storage resources are automatically created when tenants are created. + + - The new role has permissions on the computing and storage resources. This role and its permissions are automatically controlled by the system and cannot be manually managed by choosing **System** > **Permission** > **Role**. The role name is in the format of *Tenant name*\ \_\ *Cluster ID*. The ID of the first cluster is not displayed by default. + - When using this tenant, create a system user and bind the user to the role of the tenant. For details, see :ref:`Adding a User and Binding the User to a Tenant Role `. + - The sub-tenant can further allocate the resources of its parent tenant. The sum of the resource percentages of direct sub-tenants under a parent tenant at each level cannot exceed 100%. The sum of the computing resource percentages of all level-1 tenants cannot exceed 100%. + +#. Check whether the current tenant needs to be associated with resources of other services. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _mrs_03_1221__admin_guide_000119_lcdfcd36b99d84c3ba2f290f976ade15b: + + Click **Associate Service** to configure other service resources used by the current tenant. + + a. Set **Services** to **HBase**. + b. Set **Association Type** as follows: + + - **Exclusive** indicates that the service resources are used by the tenant exclusively and cannot be associated with other tenants. + - **Shared** indicates that the service resources can be shared with other tenants. + + .. note:: + + - Only HBase can be associated with a new tenant. However, HDFS, HBase, and Yarn can be associated with existing tenants. + - To associate an existing tenant with service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Associate Service** to configure resources to be associated with the tenant. + - To disassociate an existing tenant from service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Delete** in the **Operation** column. In the displayed dialog box, select **I have read the information and understand the impact** and click **OK**. + + c. Click **OK**. + +#. .. _mrs_03_1221__admin_guide_000119_l93b6a287f2a9444f9b34fcbcc1e595ac: + + Click **OK**. Wait until the system displays a message indicating that the tenant is successfully created. + +.. |image1| image:: /_static/images/en-us_image_0263899238.png diff --git a/umn/source/faq/mrs_overview/how_do_i_obtain_the_hadoop_pressure_test_tool.rst b/umn/source/faq/mrs_overview/how_do_i_obtain_the_hadoop_pressure_test_tool.rst new file mode 100644 index 0000000..cb13d1e --- /dev/null +++ b/umn/source/faq/mrs_overview/how_do_i_obtain_the_hadoop_pressure_test_tool.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1092.html + +.. _mrs_03_1092: + +How Do I Obtain the Hadoop Pressure Test Tool? +============================================== + +Download it from https://github.com/Intel-bigdata/HiBench. diff --git a/umn/source/faq/mrs_overview/how_do_i_unbind_an_eip_from_an_mrs_cluster_node.rst b/umn/source/faq/mrs_overview/how_do_i_unbind_an_eip_from_an_mrs_cluster_node.rst new file mode 100644 index 0000000..4375498 --- /dev/null +++ b/umn/source/faq/mrs_overview/how_do_i_unbind_an_eip_from_an_mrs_cluster_node.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_1246.html + +.. _mrs_03_1246: + +How Do I Unbind an EIP from an MRS Cluster Node? +================================================ + +Symptom +------- + +After an EIP is bound on the console, the EIP cannot be unbound in the EIP module of the VPC service. + +A dialog box is displayed, indicating that the operation cannot be performed because the EIP is being used by MapReduce. + +Procedure +--------- + +#. Log in to the VPC console and choose **Virtual Private Cloud** > **My VPCs**. Find the target VPC in the VPC list. +#. Click the VPC name to go to the **Summary** tab page and click the number next to **Subnets** in the **Networking Components** area to find the subnet to which the cluster belongs. +#. In the subnet list, click the target subnet name. Click the **IP Addresses** tab, locate the target public IP address and click **Unbind from EIP** in the **Operation** column. diff --git a/umn/source/faq/mrs_overview/how_do_i_use_mrs.rst b/umn/source/faq/mrs_overview/how_do_i_use_mrs.rst new file mode 100644 index 0000000..50bd61a --- /dev/null +++ b/umn/source/faq/mrs_overview/how_do_i_use_mrs.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1013.html + +.. _mrs_03_1013: + +How Do I Use MRS? +================= + +MapReduce Service (MRS) is a service you can use to deploy and manage Hadoop-based components on the Cloud. It enables you to deploy Hadoop clusters with a few clicks. MRS provides enterprise-ready big data clusters in the cloud. Tenants can fully control the clusters and easily run big data components such as Hadoop, Spark, HBase, Kafka, and Storm in the clusters. + +MRS is easy to use. You can execute various tasks and process or store PB-scale data using computers connected in a cluster. To use MRS, do as follows: + +#. Upload local programs and data files to OBS. +#. Create a cluster. You need to specify the cluster type (for example, analysis or streaming), and set ECS instance specifications, number of instances, data disk type (common I/O, high I/O, and ultra-high I/O), and components to be installed, such as Hadoop, Spark, HBase, Hive, Kafka, and Storm, in a cluster. You can use a bootstrap action to install third-party software or modify the cluster running environment on a node before or after the cluster is started. +#. Use MRS to submit, execute, and monitor your programs. +#. Manage clusters on MRS Manager, an enterprise-level unified management platform of big data clusters. You can learn about the health status of services and hosts, obtain critical system information in a timely manner from graphical metric monitoring and customization, modify service attributes based on performance requirements, and start or stop clusters, services, and role instances. +#. Terminate any MRS cluster that you do not require after job execution is complete. diff --git a/umn/source/faq/mrs_overview/index.rst b/umn/source/faq/mrs_overview/index.rst new file mode 100644 index 0000000..d5551ef --- /dev/null +++ b/umn/source/faq/mrs_overview/index.rst @@ -0,0 +1,72 @@ +:original_name: mrs_03_0002.html + +.. _mrs_03_0002: + +MRS Overview +============ + +- :ref:`What Is MRS Used For? ` +- :ref:`What Types of Distributed Storage Does MRS Support? ` +- :ref:`How Do I Create an MRS Cluster Using a Custom Security Group? ` +- :ref:`How Do I Use MRS? ` +- :ref:`Can I Configure a Phoenix Connection Pool? ` +- :ref:`Does MRS Support Change of the Network Segment? ` +- :ref:`Can I Downgrade the Specifications of an MRS Cluster Node? ` +- :ref:`What Is the Relationship Between Hive and Other Components? ` +- :ref:`Does an MRS Cluster Support Hive on Spark? ` +- :ref:`What Are the Differences Between Hive Versions? ` +- :ref:`Which MRS Cluster Version Supports Hive Connection and User Synchronization? ` +- :ref:`What Are the Differences Between OBS and HDFS in Data Storage? ` +- :ref:`How Do I Obtain the Hadoop Pressure Test Tool? ` +- :ref:`What Is the Relationship Between Impala and Other Components? ` +- :ref:`Statement About the Public IP Addresses in the Open-Source Third-Party SDK Integrated by MRS ` +- :ref:`What Is the Relationship Between Kudu and HBase? ` +- :ref:`Does MRS Support Running Hive on Kudu? ` +- :ref:`What Are the Solutions for processing 1 Billion Data Records? ` +- :ref:`Can I Change the IP address of DBService? ` +- :ref:`Can I Clear MRS sudo Logs? ` +- :ref:`Is the Storm Log also limited to 20 GB in MRS cluster 2.1.0? ` +- :ref:`What Is Spark ThriftServer? ` +- :ref:`What Access Protocols Are Supported by Kafka? ` +- :ref:`What Is the Compression Ratio of zstd? ` +- :ref:`Why Are the HDFS, YARN, and MapReduce Components Unavailable When an MRS Cluster Is Created? ` +- :ref:`Why Is the ZooKeeper Component Unavailable When an MRS Cluster Is Created? ` +- :ref:`Which Python Versions Are Supported by Spark Tasks in an MRS 3.1.0 Cluster? ` +- :ref:`How Do I Enable Different Service Programs to Use Different YARN Queues? ` +- :ref:`Differences and Relationships Between the MRS Management Console and Cluster Manager ` +- :ref:`How Do I Unbind an EIP from an MRS Cluster Node? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + what_is_mrs_used_for + what_types_of_distributed_storage_does_mrs_support + how_do_i_create_an_mrs_cluster_using_a_custom_security_group + how_do_i_use_mrs + can_i_configure_a_phoenix_connection_pool + does_mrs_support_change_of_the_network_segment + can_i_downgrade_the_specifications_of_an_mrs_cluster_node + what_is_the_relationship_between_hive_and_other_components + does_an_mrs_cluster_support_hive_on_spark + what_are_the_differences_between_hive_versions + which_mrs_cluster_version_supports_hive_connection_and_user_synchronization + what_are_the_differences_between_obs_and_hdfs_in_data_storage + how_do_i_obtain_the_hadoop_pressure_test_tool + what_is_the_relationship_between_impala_and_other_components + statement_about_the_public_ip_addresses_in_the_open-source_third-party_sdk_integrated_by_mrs + what_is_the_relationship_between_kudu_and_hbase + does_mrs_support_running_hive_on_kudu + what_are_the_solutions_for_processing_1_billion_data_records + can_i_change_the_ip_address_of_dbservice + can_i_clear_mrs_sudo_logs + is_the_storm_log_also_limited_to_20_gb_in_mrs_cluster_2.1.0 + what_is_spark_thriftserver + what_access_protocols_are_supported_by_kafka + what_is_the_compression_ratio_of_zstd + why_are_the_hdfs,_yarn,_and_mapreduce_components_unavailable_when_an_mrs_cluster_is_created + why_is_the_zookeeper_component_unavailable_when_an_mrs_cluster_is_created + which_python_versions_are_supported_by_spark_tasks_in_an_mrs_3.1.0_cluster + how_do_i_enable_different_service_programs_to_use_different_yarn_queues + differences_and_relationships_between_the_mrs_management_console_and_cluster_manager + how_do_i_unbind_an_eip_from_an_mrs_cluster_node diff --git a/umn/source/faq/mrs_overview/is_the_storm_log_also_limited_to_20_gb_in_mrs_cluster_2.1.0.rst b/umn/source/faq/mrs_overview/is_the_storm_log_also_limited_to_20_gb_in_mrs_cluster_2.1.0.rst new file mode 100644 index 0000000..d172da2 --- /dev/null +++ b/umn/source/faq/mrs_overview/is_the_storm_log_also_limited_to_20_gb_in_mrs_cluster_2.1.0.rst @@ -0,0 +1,8 @@ +:original_name: en-us_topic_0000001145676237.html + +.. _en-us_topic_0000001145676237: + +Is the Storm Log also limited to 20 GB in MRS cluster 2.1.0? +============================================================ + +In MRS cluster 2.1.0, the Storm log cannot exceed 20 GB. If the Storm log exceeds 20 GB, the log files will be deleted cyclically. Logs are stored on the system disk, therefore, the log space is limited. If you want to keep the log for longer time, mount the log directory to storage media. diff --git a/umn/source/faq/mrs_overview/statement_about_the_public_ip_addresses_in_the_open-source_third-party_sdk_integrated_by_mrs.rst b/umn/source/faq/mrs_overview/statement_about_the_public_ip_addresses_in_the_open-source_third-party_sdk_integrated_by_mrs.rst new file mode 100644 index 0000000..b18386f --- /dev/null +++ b/umn/source/faq/mrs_overview/statement_about_the_public_ip_addresses_in_the_open-source_third-party_sdk_integrated_by_mrs.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_2022.html + +.. _mrs_03_2022: + +Statement About the Public IP Addresses in the Open-Source Third-Party SDK Integrated by MRS +============================================================================================ + +The open-source third-party packages on which the open-source components integrated by MRS depend contain SDK usage examples. Public IP addresses such as 12.1.2.3, 54.123.4.56, 203.0.113.0, and 203.0.113.12 are example IP addresses. MRS will not initiate a connection to the public IP address or exchange data with the public IP address. diff --git a/umn/source/faq/mrs_overview/what_access_protocols_are_supported_by_kafka.rst b/umn/source/faq/mrs_overview/what_access_protocols_are_supported_by_kafka.rst new file mode 100644 index 0000000..cff0d2d --- /dev/null +++ b/umn/source/faq/mrs_overview/what_access_protocols_are_supported_by_kafka.rst @@ -0,0 +1,8 @@ +:original_name: en-us_topic_0000001098596550.html + +.. _en-us_topic_0000001098596550: + +What Access Protocols Are Supported by Kafka? +============================================= + +Kafka supports PLAINTEXT, SSL, SASL_PLAINTEXT, and SASL_SSL. diff --git a/umn/source/faq/mrs_overview/what_are_the_differences_between_hive_versions.rst b/umn/source/faq/mrs_overview/what_are_the_differences_between_hive_versions.rst new file mode 100644 index 0000000..dba79b4 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_are_the_differences_between_hive_versions.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1081.html + +.. _mrs_03_1081: + +What Are the Differences Between Hive Versions? +=============================================== + +Hive 3.1 has the following differences when compared with Hive 1.2: + +- String cannot be converted to int. +- The user-defined functions (UDFs) of the **Date** type are changed to Hive built-in UDFs. +- Hive 3.1 does not provide the index function anymore. +- Hive 3.1 uses the UTC time in time functions, while Hive 1.2 uses the local time zone. +- The JDBC drivers in Hive 3.1 and Hive 1.2 are incompatible. +- In Hive 3.1, column names in ORC files are case-sensitive and underscores-sensitive. +- Hive 3.1 does not allow columns named **time**. diff --git a/umn/source/faq/mrs_overview/what_are_the_differences_between_obs_and_hdfs_in_data_storage.rst b/umn/source/faq/mrs_overview/what_are_the_differences_between_obs_and_hdfs_in_data_storage.rst new file mode 100644 index 0000000..3bf515d --- /dev/null +++ b/umn/source/faq/mrs_overview/what_are_the_differences_between_obs_and_hdfs_in_data_storage.rst @@ -0,0 +1,11 @@ +:original_name: mrs_03_1062.html + +.. _mrs_03_1062: + +What Are the Differences Between OBS and HDFS in Data Storage? +============================================================== + +The data processed by MRS is from OBS or HDFS. OBS is an object-based storage service that provides secure, reliable, and cost-effective storage of huge amounts of data. MRS can directly process data in OBS. You can view, manage, and use data by using the OBS console or OBS client. In addition, you can use REST APIs independently or integrate APIs to service applications to manage and access data. + +- Data stored in OBS: Data storage is decoupled from compute. The cluster storage cost is low, and storage capacity is not limited. Clusters can be deleted at any time. However, the computing performance depends on the OBS access performance and is lower than that of HDFS. OBS is recommended for applications that do not demand a lot of computation. +- Data stored in HDFS: Data storage is not decoupled from compute. The cluster storage cost is high, and storage capacity is limited. The computing performance is high. You must export data before you delete clusters. HDFS is recommended for computing-intensive scenarios. diff --git a/umn/source/faq/mrs_overview/what_are_the_solutions_for_processing_1_billion_data_records.rst b/umn/source/faq/mrs_overview/what_are_the_solutions_for_processing_1_billion_data_records.rst new file mode 100644 index 0000000..bdf384e --- /dev/null +++ b/umn/source/faq/mrs_overview/what_are_the_solutions_for_processing_1_billion_data_records.rst @@ -0,0 +1,9 @@ +:original_name: mrs_03_1133.html + +.. _mrs_03_1133: + +What Are the Solutions for processing 1 Billion Data Records? +============================================================= + +- GaussDB (for MySQL) is recommended for scenarios, such as data updates, online transaction processing (OLTP), and complex analysis of 1 billion data records. +- Impala and Kudu in MRS also meet this requirement. Impala and Kudu can load all join tables to the memory in the join operation. diff --git a/umn/source/faq/mrs_overview/what_is_mrs_used_for.rst b/umn/source/faq/mrs_overview/what_is_mrs_used_for.rst new file mode 100644 index 0000000..d9a3886 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_mrs_used_for.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_0001.html + +.. _mrs_03_0001: + +What Is MRS Used For? +===================== + +MapReduce Service (MRS) is an enterprise-grade big data platform that allows you to quickly build and operate economical, secure, full-stack, cloud-native big data environments on the cloud. It provides engines such as ClickHouse, Spark, Flink, Kafka, and HBase, and supports convergence of data lake, data warehouse, business intelligence (BI), and artificial intelligence (AI). Fully compatible with open-source components, MRS helps you rapidly innovate and expand service growth. diff --git a/umn/source/faq/mrs_overview/what_is_spark_thriftserver.rst b/umn/source/faq/mrs_overview/what_is_spark_thriftserver.rst new file mode 100644 index 0000000..4cf6a96 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_spark_thriftserver.rst @@ -0,0 +1,8 @@ +:original_name: en-us_topic_0000001145596307.html + +.. _en-us_topic_0000001145596307: + +What Is Spark ThriftServer? +=========================== + +ThriftServer is a JDBC API. You can use JDBC to connect to ThriftServer to access SparkSQL data. Therefore, you can see JDBCServer in Spark components, but not ThriftServer. diff --git a/umn/source/faq/mrs_overview/what_is_the_compression_ratio_of_zstd.rst b/umn/source/faq/mrs_overview/what_is_the_compression_ratio_of_zstd.rst new file mode 100644 index 0000000..55e3abc --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_the_compression_ratio_of_zstd.rst @@ -0,0 +1,8 @@ +:original_name: en-us_topic_0000001145597621.html + +.. _en-us_topic_0000001145597621: + +What Is the Compression Ratio of zstd? +====================================== + +Zstandard (zstd) is an open-source fast lossless compression algorithm. The compression ratio of zstd is twice that of orc. For details, see https://github.com/L-Angel/compress-demo. CarbonData does not support lzo, and MRS has zstd integrated. diff --git a/umn/source/faq/mrs_overview/what_is_the_relationship_between_hive_and_other_components.rst b/umn/source/faq/mrs_overview/what_is_the_relationship_between_hive_and_other_components.rst new file mode 100644 index 0000000..2f9a0e3 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_the_relationship_between_hive_and_other_components.rst @@ -0,0 +1,22 @@ +:original_name: mrs_03_1046.html + +.. _mrs_03_1046: + +What Is the Relationship Between Hive and Other Components? +=========================================================== + +- Hive and HDFS + + Hive is an Apache Hadoop project. Hive uses Hadoop Distributed File System (HDFS) as its file storage system. Hive parses and processes structured data stored on HDFS. All data files in the Hive database are stored in HDFS, and all data operations on Hive are also performed using HDFS APIs. + +- Hive and MapReduce + + All data computing of Hive depends on MapReduce. MapReduce, also an Apache Hadoop project, is a parallel computing framework based on HDFS. During data analysis, Hive parses HiveQL statements submitted by users into MapReduce tasks and submits the tasks for MapReduce to execute. + +- Hive and DBService + + MetaStore (metadata service) of Hive processes the structure and attribute information about Hive databases, tables, and partitions that are stored in a relational database. In MRS, the relational database is maintained by DBService. + +- Hive and Spark + + Hive data computing can also be implemented on Spark. Spark, also an Apache project, is an in-memory distributed computing framework. During data analysis, Hive parses HiveQL statements submitted by users into Spark tasks and submits the tasks for Spark to execute. diff --git a/umn/source/faq/mrs_overview/what_is_the_relationship_between_impala_and_other_components.rst b/umn/source/faq/mrs_overview/what_is_the_relationship_between_impala_and_other_components.rst new file mode 100644 index 0000000..3565350 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_the_relationship_between_impala_and_other_components.rst @@ -0,0 +1,30 @@ +:original_name: mrs_03_1065.html + +.. _mrs_03_1065: + +What Is the Relationship Between Impala and Other Components? +============================================================= + +- Impala and HDFS + + Impala uses HDFS as its file storage system. Impala parses and processes structured data, while HDFS provides reliable underlying storage. Impala provides fast data access without moving data in HDFS. + +- Impala and Hive + + Impala uses Hive metadata, Open Database Connectivity (ODBC) driver, and SQL syntax. Unlike Hive, which is over MapReduce, Impala implements a distributed architecture based on daemon and handles all query executions on the same node. Therefore, Impala is faster than Hive by reducing the latency caused by MapReduce. + +- Impala and MapReduce + + None + +- Impala and Spark + + None + +- Impala and Kudu + + Kudu can be closely integrated with Impala to replace the combination of Impala, HDFS, and Parquet. You can insert, query, update, and delete data in Kudu tablets using Impala's SQL syntax. In addition, you can use JDBC or ODBC to connect to Kudu for data operations, using Impala as the broker. + +- Impala and HBase + + The default Impala tables use data files stored in HDFS, which is ideal for batch loading and query of full table scanning. However, HBase provides convenient and efficient query of OLTP-style organization data. diff --git a/umn/source/faq/mrs_overview/what_is_the_relationship_between_kudu_and_hbase.rst b/umn/source/faq/mrs_overview/what_is_the_relationship_between_kudu_and_hbase.rst new file mode 100644 index 0000000..56f4ba7 --- /dev/null +++ b/umn/source/faq/mrs_overview/what_is_the_relationship_between_kudu_and_hbase.rst @@ -0,0 +1,11 @@ +:original_name: mrs_03_1066.html + +.. _mrs_03_1066: + +What Is the Relationship Between Kudu and HBase? +================================================ + +Kudu is designed based on the HBase structure and can implement fast random read/write and update functions that HBase is good at. Kudu and HBase are similar in architecture. The differences are as follows: + +- HBase uses ZooKeeper to ensure data consistency, whereas Kudu uses the Raft consensus algorithm to ensure consistency. +- HBase uses HDFS for resilient data storage, whereas Kudu uses TServer to ensure strong data consistency and reliability. diff --git a/umn/source/faq/mrs_overview/what_types_of_distributed_storage_does_mrs_support.rst b/umn/source/faq/mrs_overview/what_types_of_distributed_storage_does_mrs_support.rst new file mode 100644 index 0000000..d0fa63f --- /dev/null +++ b/umn/source/faq/mrs_overview/what_types_of_distributed_storage_does_mrs_support.rst @@ -0,0 +1,80 @@ +:original_name: mrs_03_1005.html + +.. _mrs_03_1005: + +What Types of Distributed Storage Does MRS Support? +=================================================== + +MRS supports Hadoop 3.1.\ *x* and will soon support other mainstream Hadoop versions released by the community. :ref:`Table 1 ` lists the component versions supported by MRS. + +.. _mrs_03_1005__table568330975: + +.. table:: **Table 1** MRS component versions + + +---------------------------------+-----------------------------------------+------------+ + | Component | MRS 1.9.2 (Applicable to MRS 1.9.\ *x*) | MRS 3.1.0 | + +=================================+=========================================+============+ + | Alluxio | 2.0.1 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | CarbonData | 1.6.1 | 2.0.1 | + +---------------------------------+-----------------------------------------+------------+ + | DBService | 1.0.0 | 2.7.0 | + +---------------------------------+-----------------------------------------+------------+ + | Flink | 1.7.0 | 1.12.0 | + +---------------------------------+-----------------------------------------+------------+ + | Flume | 1.6.0 | 1.9.0 | + +---------------------------------+-----------------------------------------+------------+ + | HBase | 1.3.1 | 2.2.3 | + +---------------------------------+-----------------------------------------+------------+ + | HDFS | 2.8.3 | 3.1.1 | + +---------------------------------+-----------------------------------------+------------+ + | Hive | 2.3.3 | 3.1.0 | + +---------------------------------+-----------------------------------------+------------+ + | Hudi | N/A | 0.7.0 | + +---------------------------------+-----------------------------------------+------------+ + | Hue | 3.11.0 | 4.7.0 | + +---------------------------------+-----------------------------------------+------------+ + | Impala | N/A | 3.4.0 | + +---------------------------------+-----------------------------------------+------------+ + | Kafka | 1.1.0 | 2.11-2.4.0 | + +---------------------------------+-----------------------------------------+------------+ + | KafkaManager | 1.3.3.1 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | KrbServer | 1.15.2 | 1.17 | + +---------------------------------+-----------------------------------------+------------+ + | Kudu | N/A | 1.12.1 | + +---------------------------------+-----------------------------------------+------------+ + | LdapServer | 1.0.0 | 2.7.0 | + +---------------------------------+-----------------------------------------+------------+ + | Loader | 2.0.0 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | MapReduce | 2.8.3 | 3.1.1 | + +---------------------------------+-----------------------------------------+------------+ + | Oozie | N/A | 5.1.0 | + +---------------------------------+-----------------------------------------+------------+ + | Opentsdb | 2.3.0 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | Presto | 0.216 | 333 | + +---------------------------------+-----------------------------------------+------------+ + | Phoenix (integrated with HBase) | N/A | 5.0.0 | + +---------------------------------+-----------------------------------------+------------+ + | Ranger | 1.0.1 | 2.0.0 | + +---------------------------------+-----------------------------------------+------------+ + | Spark | 2.2.2 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | Spark2x | N/A | 2.4.5 | + +---------------------------------+-----------------------------------------+------------+ + | Sqoop | N/A | 1.4.7 | + +---------------------------------+-----------------------------------------+------------+ + | Storm | 1.2.1 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | Tez | 0.9.1 | 0.9.2 | + +---------------------------------+-----------------------------------------+------------+ + | YARN | 2.8.3 | 3.1.1 | + +---------------------------------+-----------------------------------------+------------+ + | ZooKeeper | 3.5.1 | 3.5.6 | + +---------------------------------+-----------------------------------------+------------+ + | MRS Manager | 1.9.2 | N/A | + +---------------------------------+-----------------------------------------+------------+ + | FusionInsight Manager | N/A | 8.1.0 | + +---------------------------------+-----------------------------------------+------------+ diff --git a/umn/source/faq/mrs_overview/which_mrs_cluster_version_supports_hive_connection_and_user_synchronization.rst b/umn/source/faq/mrs_overview/which_mrs_cluster_version_supports_hive_connection_and_user_synchronization.rst new file mode 100644 index 0000000..c02c32c --- /dev/null +++ b/umn/source/faq/mrs_overview/which_mrs_cluster_version_supports_hive_connection_and_user_synchronization.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1095.html + +.. _mrs_03_1095: + +Which MRS Cluster Version Supports Hive Connection and User Synchronization? +============================================================================ + +MRS 2.0.5 or later supports Hive connections on DataArts Studio and provides the IAM user synchronization function. diff --git a/umn/source/faq/mrs_overview/which_python_versions_are_supported_by_spark_tasks_in_an_mrs_3.1.0_cluster.rst b/umn/source/faq/mrs_overview/which_python_versions_are_supported_by_spark_tasks_in_an_mrs_3.1.0_cluster.rst new file mode 100644 index 0000000..cc00d17 --- /dev/null +++ b/umn/source/faq/mrs_overview/which_python_versions_are_supported_by_spark_tasks_in_an_mrs_3.1.0_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1216.html + +.. _mrs_03_1216: + +Which Python Versions Are Supported by Spark Tasks in an MRS 3.1.0 Cluster? +=========================================================================== + +For MRS 3.1.0 clusters, Python 2.7 or 3.\ *x* is recommended for Spark tasks. diff --git a/umn/source/faq/mrs_overview/why_are_the_hdfs,_yarn,_and_mapreduce_components_unavailable_when_an_mrs_cluster_is_created.rst b/umn/source/faq/mrs_overview/why_are_the_hdfs,_yarn,_and_mapreduce_components_unavailable_when_an_mrs_cluster_is_created.rst new file mode 100644 index 0000000..ef6d11e --- /dev/null +++ b/umn/source/faq/mrs_overview/why_are_the_hdfs,_yarn,_and_mapreduce_components_unavailable_when_an_mrs_cluster_is_created.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1202.html + +.. _mrs_03_1202: + +Why Are the HDFS, YARN, and MapReduce Components Unavailable When an MRS Cluster Is Created? +============================================================================================ + +The HDFS, YARN, and MapReduce components are integrated in Hadoop. If the three components are unavailable when are MRS cluster is created, select Hadoop instead. After an MRS cluster is created, HDFS, YARN, and MapReduce are available in the **Components** page. diff --git a/umn/source/faq/mrs_overview/why_is_the_zookeeper_component_unavailable_when_an_mrs_cluster_is_created.rst b/umn/source/faq/mrs_overview/why_is_the_zookeeper_component_unavailable_when_an_mrs_cluster_is_created.rst new file mode 100644 index 0000000..a3297da --- /dev/null +++ b/umn/source/faq/mrs_overview/why_is_the_zookeeper_component_unavailable_when_an_mrs_cluster_is_created.rst @@ -0,0 +1,12 @@ +:original_name: mrs_03_1204.html + +.. _mrs_03_1204: + +Why Is the ZooKeeper Component Unavailable When an MRS Cluster Is Created? +========================================================================== + +If you create a cluster of a version earlier than MRS 3.\ *x*, ZooKeeper is installed by default and is not displayed on the GUI. + +If you create a cluster of MRS 3.\ *x* or later, ZooKeeper is available on the GUI and is selected by default. + +After the cluster is created, the ZooKeeper component is available on the **Components** page. diff --git a/umn/source/faq/performance_tuning/can_i_change_the_os_of_an_mrs_cluster.rst b/umn/source/faq/performance_tuning/can_i_change_the_os_of_an_mrs_cluster.rst new file mode 100644 index 0000000..968cb8f --- /dev/null +++ b/umn/source/faq/performance_tuning/can_i_change_the_os_of_an_mrs_cluster.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1203.html + +.. _mrs_03_1203: + +Can I Change the OS of an MRS Cluster? +====================================== + +The OS of an MRS cluster cannot be changed. diff --git a/umn/source/faq/performance_tuning/does_an_mrs_cluster_support_system_reinstallation.rst b/umn/source/faq/performance_tuning/does_an_mrs_cluster_support_system_reinstallation.rst new file mode 100644 index 0000000..292920a --- /dev/null +++ b/umn/source/faq/performance_tuning/does_an_mrs_cluster_support_system_reinstallation.rst @@ -0,0 +1,8 @@ +:original_name: mrs_03_1017.html + +.. _mrs_03_1017: + +Does an MRS Cluster Support System Reinstallation? +================================================== + +An MRS cluster does not support system reinstallation. diff --git a/umn/source/faq/performance_tuning/how_do_i_improve_the_resource_utilization_of_core_nodes_in_a_cluster.rst b/umn/source/faq/performance_tuning/how_do_i_improve_the_resource_utilization_of_core_nodes_in_a_cluster.rst new file mode 100644 index 0000000..6291a21 --- /dev/null +++ b/umn/source/faq/performance_tuning/how_do_i_improve_the_resource_utilization_of_core_nodes_in_a_cluster.rst @@ -0,0 +1,23 @@ +:original_name: mrs_03_1090.html + +.. _mrs_03_1090: + +How Do I Improve the Resource Utilization of Core Nodes in a Cluster? +===================================================================== + +#. Go to the Yarn service configuration page. + + - For versions earlier than 1.9.2, + + log in to MRS Manager, choose **Services** > **Yarn** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Yarn** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - MRS 3.\ *x* or later: Log in to FusionInsight Manager. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **Configurations** > **All Configurations**. + +#. Search for **yarn.nodemanager.resource.memory-mb**, and increase the value based on the actual memory of the cluster nodes. +#. Save the change and restart the affected services or instances. diff --git a/umn/source/faq/performance_tuning/how_do_i_stop_the_firewall_service.rst b/umn/source/faq/performance_tuning/how_do_i_stop_the_firewall_service.rst new file mode 100644 index 0000000..beacd69 --- /dev/null +++ b/umn/source/faq/performance_tuning/how_do_i_stop_the_firewall_service.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1072.html + +.. _mrs_03_1072: + +How Do I Stop the Firewall Service? +=================================== + +#. Log in to each node of a cluster as user **root**. + +#. Check whether the firewall service is started. + + For example, to check the firewall status on EulerOS, run the **systemctl status firewalld.service** command. + +#. Stop the firewall service. + + For example, to stop the firewall service on EulerOS, run the **systemctl stop firewalld.service** command. diff --git a/umn/source/faq/performance_tuning/index.rst b/umn/source/faq/performance_tuning/index.rst new file mode 100644 index 0000000..675136d --- /dev/null +++ b/umn/source/faq/performance_tuning/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_03_2008.html + +.. _mrs_03_2008: + +Performance Tuning +================== + +- :ref:`Does an MRS Cluster Support System Reinstallation? ` +- :ref:`Can I Change the OS of an MRS Cluster? ` +- :ref:`How Do I Improve the Resource Utilization of Core Nodes in a Cluster? ` +- :ref:`How Do I Stop the Firewall Service? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + does_an_mrs_cluster_support_system_reinstallation + can_i_change_the_os_of_an_mrs_cluster + how_do_i_improve_the_resource_utilization_of_core_nodes_in_a_cluster + how_do_i_stop_the_firewall_service diff --git a/umn/source/faq/web_page_access/how_do_i_change_the_session_timeout_duration_for_an_open_source_component_web_ui.rst b/umn/source/faq/web_page_access/how_do_i_change_the_session_timeout_duration_for_an_open_source_component_web_ui.rst new file mode 100644 index 0000000..2cf749a --- /dev/null +++ b/umn/source/faq/web_page_access/how_do_i_change_the_session_timeout_duration_for_an_open_source_component_web_ui.rst @@ -0,0 +1,101 @@ +:original_name: mrs_03_1151.html + +.. _mrs_03_1151: + +How Do I Change the Session Timeout Duration for an Open Source Component Web UI? +================================================================================= + +You need to set a proper web session timeout duration for security purposes. To change the session timeout duration, do as follows: + +Checking Whether the Cluster Supports Session Timeout Duration Adjustment +------------------------------------------------------------------------- + +- For MRS cluster versions earlier than 3.x: + + #. On the cluster details page, choose **Components** > **meta** > **Service Configuration**. + + #. Switch **Basic** to **All**, and search for the **http.server.session.timeout.secs**. + + If **http.server.session.timeout.secs** does not exist, the cluster does not support change of the session timeout duration. If the parameter exists, perform the following steps to modify it. + +- MRS 3.x and later: Log in to FusionInsight Manager and choose **Cluster** > **Services** > **meta**. On the displayed page, click **Configurations** and select **All Configurations**. Search for the **http.server.session.timeout.secs** configuration item. If this configuration item exists, perform the following steps to modify it. If the configuration item does not exist, the version does not support dynamic adjustment of the session duration. + +You are advised to set all session timeout durations to the same value. Otherwise, the settings of some parameters may not take effect due to value conflict. + +Modifying the Timeout Duration on Manager and the Authentication Center Page +---------------------------------------------------------------------------- + +**For clusters of versions earlier than MRS 3.\ x:** + +#. Log in to each master node in the cluster and perform :ref:`2 ` to :ref:`4 `. + +#. .. _mrs_03_1151__li11295622161919: + + Change the value of **20** in the **/opt/Bigdata/apache-tomcat-7.0.78/webapps/cas/WEB-INF/web.xml** file. **20** indicates the session timeout duration, in minutes. Change it based on service requirements. The maximum value is 480 minutes. + +#. Change the value of **20** in the **/opt/Bigdata/apache-tomcat-7.0.78/webapps/web/WEB-INF/web.xml** file. **20** indicates the session timeout duration, in minutes. Change it based on service requirements. The maximum value is 480 minutes. + +#. .. _mrs_03_1151__li629512210191: + + Change the values of **p:maxTimeToLiveInSeconds="${tgt.maxTimeToLiveInSeconds:1200}"** and **p:timeToKillInSeconds="${tgt.timeToKillInSeconds:1200}"** in the **/opt/Bigdata/apache-tomcat-7.0.78/webapps/cas/WEB-INF/spring-configuration/ticketExpirationPolicies.xml** file. The maximum value is 28,800 seconds. + +#. Restart the Tomcat node on the active master node. + + a. .. _mrs_03_1151__li11295192217194: + + On the active master node, run the **netstat -anp \|grep 28443 \|grep LISTEN \| awk '{print $7}'** command as user **omm** to query the Tomcat process ID. + + b. Run the **kill -9** *{pid}* command, in which *{pid}* indicates the Tomcat process ID obtained in :ref:`5.a `. + + c. Wait until the process automatically restarts. You can run the **netstat -anp \|grep 28443 \|grep LISTEN** command to check whether the process is successfully restarted. If the process can be queried, the process is successfully restarted. If the process cannot be queried, query the process again later. + +**For clusters of MRS 3.\ x** **or later** + +#. Log in to each master node in the cluster and perform :ref:`2 ` to :ref:`3 ` on each master node. + +#. .. _mrs_03_1151__li388340175410: + + Change the value of **20** in the **/opt/Bigdata/om-server_xxx/apache-tomcat-xxx/webapps/web/WEB-INF/web.xml** file. **20** indicates the session timeout duration, in minutes. Change it based on service requirements. The maximum value is 480 minutes. + +#. .. _mrs_03_1151__li10492102775910: + + Add **ticket.tgt.timeToKillInSeconds=28800** to the **/opt/Bigdata/om-server_xxx/apache-tomcat-8.5.63/webapps/cas/WEB-INF/classes/config/application.properties** file. **ticket.tgt.timeToKillInSeconds** indicates the validity period of the authentication center, in seconds. Change it based on service requirements. The maximum value is 28,800 seconds. + +#. Restart the Tomcat node on the active master node. + + a. .. _mrs_03_1151__li145761437910: + + On the active master node, run the **netstat -anp \|grep 28443 \|grep LISTEN \| awk '{print $7}'** command as user **omm** to query the Tomcat process ID. + + b. Run the **kill -9** *{pid}* command, in which *{pid}* indicates the Tomcat process ID obtained in :ref:`4.a `. + + c. Wait until the process automatically restarts. + + You can run the **netstat -anp \|grep 28443 \|grep LISTEN** command to check whether the process is successfully restarted. If the process is displayed, the process is successfully restarted. If the process is not displayed, query the process again later. + +Modifying the Timeout Duration for an Open-Source Component Web UI +------------------------------------------------------------------ + +#. Access the **All Configurations** page. + + - For MRS cluster versions earlier than MRS 3.x: + + On the cluster details page, choose **Components > Meta > Service Configuration**. + + - For MRS cluster version 3.\ *x* or later: + + Log in to FusionInsight Manager and choose **Cluster** > **Services** > **meta**. On the displayed page, click **Configurations** and select **All Configurations**. + +#. Change the value of **http.server.session.timeout.secs** under **meta** as required. The unit is second. + +#. Save the settings, deselect **Restart the affected services or instances**, and click **OK**. + + You are advised to perform the restart during off-peak hours. + +#. (Optional) If you need to use the Spark web UI, search for **spark.session.maxAge** on the **All Configurations** page of Spark and change the value (in seconds). + + Save the settings, deselect **Restart the affected services or instances**, and click **OK**. + +#. Restart the meta service and components on web UI, or restart the cluster during off-peak hours. + + To prevent service interruption, restart the service during off-peak hours or perform a rolling restart. diff --git a/umn/source/faq/web_page_access/index.rst b/umn/source/faq/web_page_access/index.rst new file mode 100644 index 0000000..90cbddb --- /dev/null +++ b/umn/source/faq/web_page_access/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_2006.html + +.. _mrs_03_2006: + +Web Page Access +=============== + +- :ref:`How Do I Change the Session Timeout Duration for an Open Source Component Web UI? ` +- :ref:`Why Cannot I Refresh the Dynamic Resource Plan Page on MRS Tenant Tab? ` +- :ref:`What Do I Do If the Kafka Topic Monitoring Tab Is Unavailable on Manager? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_change_the_session_timeout_duration_for_an_open_source_component_web_ui + why_cannot_i_refresh_the_dynamic_resource_plan_page_on_mrs_tenant_tab + what_do_i_do_if_the_kafka_topic_monitoring_tab_is_unavailable_on_manager diff --git a/umn/source/faq/web_page_access/what_do_i_do_if_the_kafka_topic_monitoring_tab_is_unavailable_on_manager.rst b/umn/source/faq/web_page_access/what_do_i_do_if_the_kafka_topic_monitoring_tab_is_unavailable_on_manager.rst new file mode 100644 index 0000000..d659751 --- /dev/null +++ b/umn/source/faq/web_page_access/what_do_i_do_if_the_kafka_topic_monitoring_tab_is_unavailable_on_manager.rst @@ -0,0 +1,18 @@ +:original_name: mrs_03_1166.html + +.. _mrs_03_1166: + +What Do I Do If the Kafka Topic Monitoring Tab Is Unavailable on Manager? +========================================================================= + +#. Log in to each Master node of the cluster and switch to user **omm**. + +#. Go to the **/opt/Bigdata/apache-tomcat-7.0.78/webapps/web/WEB-INF/lib/components/Kafka/** directory. + +#. Run the **cp /opt/share/zookeeper-3.5.1-mrs-2.0/zookeeper-3.5.1-mrs-2.0.jar ./** command to copy the ZooKeeper package. + +#. Restart the Tomcat process. + + **sh /opt/Bigdata/apache-tomcat-7.0.78/bin/shutdown.sh** + + **sh /opt/Bigdata/apache-tomcat-7.0.78/bin/startup.sh** diff --git a/umn/source/faq/web_page_access/why_cannot_i_refresh_the_dynamic_resource_plan_page_on_mrs_tenant_tab.rst b/umn/source/faq/web_page_access/why_cannot_i_refresh_the_dynamic_resource_plan_page_on_mrs_tenant_tab.rst new file mode 100644 index 0000000..1583e3c --- /dev/null +++ b/umn/source/faq/web_page_access/why_cannot_i_refresh_the_dynamic_resource_plan_page_on_mrs_tenant_tab.rst @@ -0,0 +1,16 @@ +:original_name: mrs_03_1156.html + +.. _mrs_03_1156: + +Why Cannot I Refresh the Dynamic Resource Plan Page on MRS Tenant Tab? +====================================================================== + +#. Log in to the Master1 and Master2 nodes as user **root**. + +#. Run the **ps -ef \|grep aos** command to check the AOS process ID. + +#. Run the **kill -9** *AOS process ID* command to end the AOS process. + +#. Wait until the AOS process is automatically restarted. + + You can run the **ps -ef \|grep aos** command to check whether the AOS process restarts successfully. If the process exists, the restart is successful and the **Dynamic Resource Plan** page will be refreshed. If the process does not exist, retry later. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12001_audit_log_dumping_failure.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12001_audit_log_dumping_failure.rst new file mode 100644 index 0000000..bb605df --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12001_audit_log_dumping_failure.rst @@ -0,0 +1,143 @@ +:original_name: ALM-12001.html + +.. _ALM-12001: + +ALM-12001 Audit Log Dumping Failure +=================================== + +Description +----------- + +Cluster audit logs need to be dumped on a third-party server due to the local historical data backup policy. The system starts to check the dump server at 3 a.m. every day. If the dump server meets the configuration conditions, audit logs can be successfully dumped. This alarm is generated when the audit log dump fails if the disk space of the dump directory on the third-party server is insufficient or a user changes the username, password, or dump directory of the dump server. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12001 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +System can store a maximum of only 50 dump files locally. If the fault persists on the dump server, the local audit logs may be lost. + +Possible Causes +--------------- + +- The network connection is abnormal. +- The username, password, or dump directory of the dump server does not meet the configuration conditions. +- The disk space of the dump directory is insufficient. + +Procedure +--------- + +**Check whether the network connection is normal.** + +#. On the FusionInsight Manager home page, choose **Audit > Configurations**. + +#. Check whether the SFTP IP on the dump configuration page is valid. + + Log in to the node where Manager is located as user **root** and run the **ping** command to check whether the network connection between the SFTP server and the cluster is normal. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-12001__li64797305153659: + + Repair the network connection, reset the SFTP password, and click **OK**. + +#. Wait for 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the username, password, or dump directory are correct.** + +5. .. _alm-12001__li33093593154533: + + On the dump configuration page, check whether the username, password, and dump directory of the third-party server are correct. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-12001__li63335387154533: + + Change the username, password, or dump directory, reset the SFTP password and click **OK**. + +7. Wait for 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check whether the disk space of the dump directory is sufficient.** + +8. .. _alm-12001__li56273719154547: + + Log in to the third-party server as user **root** and run the **df** command to check whether the disk space of the dump directory of the third-party server exceeds 100 MB. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`9 `. + +9. .. _alm-12001__li61877356154547: + + Expand disk space capacity for the third-party server, Reset the SFTP password and click **OK** + +10. Wait for 2 minutes, view real-time alarms and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Reset the dump rule.** + +11. .. _alm-12001__li37575023154554: + + On the FusionInsight Manager home page, choose **Audit > Configurations**. + +12. Reset dump rules, set the parameters properly, and click **OK**. + +13. Wait for 2 minutes, view real-time alarms and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Collect fault information.** + +14. .. _alm-12001__li5991045915463: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +15. Select **OmmServer** from the **Service** and click **OK**. + +16. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +17. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383808.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12004_oldap_resource_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12004_oldap_resource_abnormal.rst new file mode 100644 index 0000000..cd933f5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12004_oldap_resource_abnormal.rst @@ -0,0 +1,99 @@ +:original_name: ALM-12004.html + +.. _ALM-12004: + +ALM-12004 OLdap Resource Abnormal +================================= + +Description +----------- + +The system checks LDAP resources every 60 seconds. This alarm is generated when the system detects that the LDAP resources in Manager are abnormal for six consecutive times. + +This alarm is cleared when the Ldap resource in the Manager recovers and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12004 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The Manager and component WebUI authentication services are unavailable and cannot provide security authentication and user management functions for web upper-layer services. Users may be unable to log in to the WebUIs of Manager and components. + +Possible Causes +--------------- + +The LdapServer process in the Manager is abnormal. + +Procedure +--------- + +**Check whether the LdapServer process in the Manager is normal.** + +#. Log in the Manager node in the cluster as user **omm**. + + Log in to FusionInsight Manager using the floating IP address, and run the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command to check the information about the current Manager two-node cluster. + +#. Run **ps -ef \| grep slapd** command to check whether the LdapServer resource process in the **${BIGDATA_HOME}/om-server/om/** in the process configuration file is running properly. + + .. note:: + + You can determine that the resource is normal by checking the following information: + + a. After the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command runs, **ResHAStatus** of the OLdap is **Normal**. + b. After the **ps -ef \| grep slapd** command runs, the slapd process of port 21750 can be viewed. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12004__l6ef892f9c8f749aa9e6871e1a63797b1: + + Run the **kill -2** *ldap pid* command to restart the LdapServer process and wait for 20 seconds. The HA starts the OLdap process automatically. Check whether the current OLdap resource is in normal state. + + - If yes, the operation is complete. + - If no, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12004__l4b1abbc809ee41c28ade2b2c4cfa6fde: + + On the FusionInsight Manager home page, choose **O&M** > **Log > Download**. + +5. Select **OmsLdapServer** and **OmmServer** from the **Service** and click **OK**. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383809.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12005_okerberos_resource_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12005_okerberos_resource_abnormal.rst new file mode 100644 index 0000000..ed9dd7f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12005_okerberos_resource_abnormal.rst @@ -0,0 +1,92 @@ +:original_name: ALM-12005.html + +.. _ALM-12005: + +ALM-12005 OKerberos Resource Abnormal +===================================== + +Description +----------- + +The alarm module checks the status of the Kerberos resource in Manager every 80 seconds. This alarm is generated when the alarm module detects that the Kerberos resources are abnormal for six consecutive times. + +This alarm is cleared when the Kerberos resource recovers and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12005 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The component WebUI authentication services are unavailable and cannot provide security authentication functions for web upper-layer services. Users may be unable to log in to FusionInsight Manager and the WebUIs of components. + +Possible Causes +--------------- + +The OLdap resource on which the Okerberos depends is abnormal. + +Procedure +--------- + +**Check whether the OLdap resource on which the Okerberos depends is abnormal in the Manager.** + +#. Log in the Manager node in the cluster as user **omm**. + + Log in to FusionInsight Manager using the floating IP address, and run the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command to check the information about the current Manager two-node cluster. + +#. Run the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the OLdap resource status managed by HA is normal. (In single-node mode, the OLdap resource is in the Active_normal state; in the two-node mode, the OLdap resource is in the Active_normal state on the active node and in the Standby_normal state on the standby node.) + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-12005__li4031832916486: + + See the procedure in :ref:`ALM-12004 OLdap Resource Abnormal ` to resolve the problem. After the OLdap resource status recovers, check whether the OKerberos resource status is normal. + + - If yes, the operation is complete. + - If no, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12005__li34421516164820: + + On the FusionInsight Manager home page, choose **O&M** > **Log > Download**. + +5. Select **OmsKerberos** and **OmmServer** from the **Service** and click **OK**. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383810.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12006_node_fault.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12006_node_fault.rst new file mode 100644 index 0000000..3dce09b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12006_node_fault.rst @@ -0,0 +1,178 @@ +:original_name: ALM-12006.html + +.. _ALM-12006: + +ALM-12006 Node Fault +==================== + +Description +----------- + +Controller checks the NodeAgent heartbeat every 30 seconds. If Controller does not receive heartbeat messages from a NodeAgent, it attempts to restart the NodeAgent process. This alarm is generated if the NodeAgent fails to be restarted for three consecutive times. + +This alarm is cleared when Controller can properly receive the status report of the NodeAgent. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Services on the node are unavailable. + +Possible Causes +--------------- + +The network is disconnected, the hardware is faulty, or the operating system runs slowly. + +Procedure +--------- + +**Check whether the network is disconnected, whether the hardware is faulty, or whether the operating system runs commands slowly.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the page that is displayed, click |image1| in the row containing the alarm, click the host name, and view the IP address of the host for which the alarm is generated. + +#. Log in to the active management node as user **root**. + +#. Run the **ping** *IP address of the faulty host* command to check whether the faulty node is reachable. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12006__li61437024165028: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12006__li23885090165028: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +#. .. _alm-12006__li9040006165028: + + Contact the hardware administrator to check whether the hardware (CPU or memory) of the node is faulty. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`12 `. + +#. .. _alm-12006__li15590464165028: + + Repair or replace faulty components and restart the node. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +#. .. _alm-12006__li4828856593250: + + If a large number of node faults are reported in the cluster, the floating IP addresses may be abnormal. As a result, Controller cannot detect the NodeAgent heartbeat. + + Log in to any management node and view the **/var/log/Bigdata/omm/oms/ha/scriptlog/floatip.log** log to check whether the logs generated one to two minutes before and after the faults occur are complete. + + For example, a complete log is in the following format: + + .. code-block:: + + 2017-12-09 04:10:51,000 INFO (floatip) Read from ${BIGDATA_HOME}/om-server_8.1.0.1/om/etc/om/routeSetConf.ini,value is : yes + 2017-12-09 04:10:51,000 INFO (floatip) check wsNetExport : eth0 is up. + 2017-12-09 04:10:51,000 INFO (floatip) check omNetExport : eth0 is up. + 2017-12-09 04:10:51,000 INFO (floatip) check wsInterface : eRth0:oms, wsFloatIp: XXX.XXX.XXX.XXX. + 2017-12-09 04:10:51,000 INFO (floatip) check omInterface : eth0:oms, omFloatIp: XXX.XXX.XXX.XXX. + 2017-12-09 04:10:51,000 INFO (floatip) check wsFloatIp : XXX.XXX.XXX.XXX is reachable. + 2017-12-09 04:10:52,000 INFO (floatip) check omFloatIp : XXX.XXX.XXX.XXX is reachable. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`9 `. + +#. .. _alm-12006__li3216108493510: + + Check whether the omNetExport log is printed after the wsNetExport is detected or whether the interval for printing two logs exceeds 10 seconds or longer. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`12 `. + +#. .. _alm-12006__li1419227193519: + + View the **/var/log/message** file of the OS to check whether sssd frequently restarts or nscd exception information is displayed when the fault occurs. For Red Hat, check sssd information. For SUSE, check nscd information. + + sssd restart example + + .. code-block:: + + Feb 7 11:38:16 10-132-190-105 sssd[pam]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[be[default]]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[be[default]]: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[pam]: Starting up + + Example nscd exception information + + .. code-block:: + + Feb 11 11:44:42 10-120-205-33 nscd: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.55:21780: Can't contact LDAP server + Feb 11 11:44:43 10-120-205-33 ntpq: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.55:21780: Can't contact LDAP server + Feb 11 11:44:44 10-120-205-33 ntpq: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.92:21780: Can't contact LDAP server + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`12 `. + +#. .. _alm-12006__li5998962193529: + + Check whether the LdapServer node is faulty, for example, the service IP address is unreachable or the network latency is too high. If the fault occurs periodically, locate and eliminate it and run the **top** command to check whether abnormal software exists. + +**Collect the fault information.** + +12. .. _alm-12006__li6096449165028: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +13. Select the following nodes from **Services** and click **OK**. + + - NodeAgent + - Controller + - OS + +14. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +15. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895827.png +.. |image2| image:: /_static/images/en-us_image_0263895607.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12007_process_fault.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12007_process_fault.rst new file mode 100644 index 0000000..a436f52 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12007_process_fault.rst @@ -0,0 +1,144 @@ +:original_name: ALM-12007.html + +.. _ALM-12007: + +ALM-12007 Process Fault +======================= + +Description +----------- + +This alarm is generated when the process health check module detects that the process connection status is **Bad** for three consecutive times. The process health check module checks the process status every 5 seconds. + +This alarm is cleared when the process can be connected. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12007 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service provided by the process is unavailable. + +Possible Causes +--------------- + +- The instance process is abnormal. +- The disk space is insufficient. + +.. note:: + + If a large number of process fault alarms exist in a time segment, files in the installation directory may be deleted mistakenly or permission on the directory may be modified. + +Procedure +--------- + +**Check whether the instance process is abnormal.** + +#. .. _alm-12007__li42005517036: + + In the FusionInsight Manager portal, click **O&M > Alarm > Alarms**, click |image1| in the row where the alarm is located , and click the host name to view the host address for which the alarm is generated + +#. On the **Alarms** page, check whether the :ref:`ALM-12006 Node Fault ` is generated. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12007__li20006517036: + + Handle the alarm according to :ref:`ALM-12006 Node Fault `. + +#. .. _alm-12007__li195150317036: + + Log in to the host for which the alarm is generated as user **root**. Check whether the installation directory user, user group, and permission of the alarm role are correct. The user, user group, and the permission must be **omm:ficommon 750**. + + For example, the NameNode installation directory is *${BIGDATA_HOME}*\ **/FusionInsight_Current/**\ *1_8_NameNode*\ **/etc**. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`5 `. + +#. .. _alm-12007__li3247692317036: + + Run the following command to set the permission to **750** and **User:Group** to **omm:ficommon**: + + **chmod 750** ** + + **chown omm:ficommon** ** + +#. .. _alm-12007__li3396349817036: + + Wait for 5 minutes. In the alarm list, check whether **ALM-12007 Process Fault** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether disk space is sufficient.** + +7. .. _alm-12007__li2657388817036: + + On the FusionInsight Manager, check whether the alarm list contains **ALM-12017 Insufficient Disk Capacity**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`11 `. + +8. .. _alm-12007__li500135217036: + + Rectify the fault by following the steps provided in :ref:`ALM-12017 Insufficient Disk Capacity `. + +9. Wait for 5 minutes. In the alarm list, check whether **ALM-12017 Insufficient Disk Capacity** is cleared. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`11 `. + +10. .. _alm-12007__li1723673717036: + + Wait for 5 minutes. In the alarm list, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-12007__li1622379717036: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +12. According to the service name obtained in :ref:`1 `, select the component and **NodeAgent** from the **Service** and click **OK**. + +13. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001080201158.png +.. |image2| image:: /_static/images/en-us_image_0269383814.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst new file mode 100644 index 0000000..9e00e6a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst @@ -0,0 +1,156 @@ +:original_name: ALM-12010.html + +.. _ALM-12010: + +ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes +============================================================================= + +Description +----------- + +This alarm is generated when the active Mager does not receive the heartbeat signal from the standby Manager within 7 seconds. + +This alarm is cleared when the active Manager receives heartbeat signals from the standby Manager. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12010 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the active Manager process is abnormal, an active/standby failover cannot be performed, and services are affected. + +Possible Causes +--------------- + +- The link between the active and standby Manager is abnormal. +- The node name configuration is incorrect. +- The port is disabled by the firewall. + +Procedure +--------- + +**Check whether the network between the active and standby Manager server is normal.** + +#. In the FusionInsight Manager portal, click **O&M > Alarm > Alarms**, click |image1| in the row containing the alarm and view the IP address of the standby Manager (Peer Manager) server in the alarm details. + +#. Log in to the active Manager server as user **root**. + +#. Run the **ping** *standby Manager heartbeat IP address* command to check whether the standby Manager server is reachable. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12010__li18651103915205: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12010__li166511739102017: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +#. .. _alm-12010__li206521339172011: + + Run the following command to go to the software installation directory: + + **cd /opt** + +#. Run the following command to find the configuration file directory of the active and standby nodes. + + **find -name hacom_local.xml** + +#. Run the following command to go to the **workspace** directory: + + **cd${BIGDATA_HOME}/om-server/OMS/workspace0/ha/local/hacom/conf/** + +#. Run the **vim** command to open the **hacom_local.xml** file. Check whether the local and peer nodes are correctly configured. The local node is configured as the active node, and the peer node is configured as the standby node. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`10 `. + +#. .. _alm-12010__li18655163992011: + + Modify the configuration of the active and standby nodes in the **hacom_local.xml** file and press **Esc** to return to the command mode. Run the **:wq** command to save the modification and exit. + +#. Check whether the alarm is cleared automatically. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Check whether the port is disabled by the firewall.** + +12. .. _alm-12010__li56481639112012: + + Run the **lsof -i :20012** command to check whether the heartbeat ports of the active and standby nodes are enabled. If the command output is displayed, the ports are enabled. Otherwise, the ports are disabled by the firewall. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`16 `. + +13. .. _alm-12010__li8648153982010: + + Run the **iptables -P INPUT ACCEPT** command to avoid the server disconnection. + +14. Run the following command to clear the firewall: + + **iptables -F** + +15. Check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +**Collect fault information.** + +16. .. _alm-12010__li41244883171443: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +17. Select the following nodes from the **Service** and click **OK**: + + - OmmServer + - Controller + - NodeAgent + +18. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +19. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383815.png +.. |image2| image:: /_static/images/en-us_image_0269383816.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12011_manager_data_synchronization_exception_between_the_active_and_standby_nodes.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12011_manager_data_synchronization_exception_between_the_active_and_standby_nodes.rst new file mode 100644 index 0000000..27dd1cd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12011_manager_data_synchronization_exception_between_the_active_and_standby_nodes.rst @@ -0,0 +1,196 @@ +:original_name: ALM-12011.html + +.. _ALM-12011: + +ALM-12011 Manager Data Synchronization Exception Between the Active and Standby Nodes +===================================================================================== + +Description +----------- + +The system checks data synchronization between the active and standby Manager nodes every 60 seconds. This alarm is generated when the standby Manager fails to synchronize files with the active Manager. + +This alarm is cleared when the standby Manager synchronizes files with the active Manager. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12011 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Some configurations will be lost after an active/standby switchover because the configuration files on the standby Manager are not updated. Maybe Manager and some components cannot run properly. + +Possible Causes +--------------- + +- The link between the active and standby Managers is interrupted or The storage space of the **/srv/BigData/LocalBackup** directory is full. +- The synchronization file does not exist or the file permission is incorrect. + +Procedure +--------- + +**Check whether the network between the active Manager server and the standby Manager server is normal.** + +#. In the FusionInsight Manager portal, click **O&M > Alarm > Alarms**, click |image1| in the row where the alarm is located and obtain the standby Manager server IP address (Peer Manager IP address) in the alarm details. + +#. Log in to the active Manager server as user **root**. + +#. Run the **ping** *standby Manager IP address* command to check whether the standby Manager server is reachable. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12011__li3033024171750: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12011__li52745930171750: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether the storage space of the** **/srv/BigData/LocalBackup** **directory is full.** + +6. .. _alm-12011__li983315367129: + + Run the following command to check whether the storage space of the **/srv/BigData/LocalBackup** directory is full: + + **df -hl /srv/BigData/LocalBackup** + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`10 `. + +7. .. _alm-12011__li11402194014150: + + Run the following command to clear unnecessary backup files: + + **rm -rf** *Directory to be cleared* + + Example: + + **rm -rf** **/srv/BigData/LocalBackup/0/default-oms_20191211143443** + +8. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + + In the **Operation** column of the backup task to be performed, click **Configure** and change the value of **Maximum Number of Backup Copies** to reduce the number of backup file sets. + +9. Wait about 1 minute and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Check whether the synchronization file exists and whether the file permission is normal.** + +10. .. _alm-12011__li1826164817917: + + Run the following command to check whether the synchronization file exists. + + **find /srv/BigData/ -name "sed*"** + + **find /opt -name "sed*"** + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`12 `. + +11. .. _alm-12011__li1926214814915: + + Run the following command to view the synchronization file information and permission obtained in :ref:`10 `. + + **ll** *path of the file to be found* + + - If the size of the file is 0 and the permission column is **-**, the file is a junk file. Run the following command to delete it. + + **rm -rf** *files to be deleted* + + Wait for several minutes and check whether the alarm is cleared. If the alarm persists, go to :ref:`12 `. + + - If the file size is not 0, go to :ref:`12 `. + +12. .. _alm-12011__li192637482095: + + View the log files generated when the alarm is generated. + + a. Run the following command to switch to the HA run log file path. + + **cd /var/log/Bigdata/omm/oms/ha/runlog**/ + + b. Decompress and view the log files generated when the alarm is generated. + + For example, if the name of the file to be viewed is **ha.log.2021-03-22_12-00-07.gz**, run the following command: + + **gunzip** *ha.log.2021-03-22_12-00-07.gz* + + **vi** *ha.log.2021-03-22_12-00-07* + + Check whether error information is reported before and after the alarm generation time. + + - If yes, rectify the fault based on the error information. Then go to :ref:`13 `. + + For example, if the following error information is displayed, the directory permission is insufficient. In this case, change the directory permission to be the same as that on the normal node. + + |image2| + + - If no, go to :ref:`14 `. + +13. .. _alm-12011__li985632952514: + + Wait about 10 minute and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Collect fault information.** + +14. .. _alm-12011__li65512922171750: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +15. Select the following nodes from the **Service** and click **OK**: + + - OmmServer + - Controller + - NodeAgent + +16. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +17. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383817.png +.. |image2| image:: /_static/images/en-us_image_0000001271157721.png +.. |image3| image:: /_static/images/en-us_image_0269383818.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12014_partition_lost.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12014_partition_lost.rst new file mode 100644 index 0000000..6147693 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12014_partition_lost.rst @@ -0,0 +1,112 @@ +:original_name: ALM-12014.html + +.. _ALM-12014: + +ALM-12014 Partition Lost +======================== + +Description +----------- + +The system checks the partition status every 60 seconds. This alarm is generated when the system detects that a partition to which service directories are mounted is lost (because the device is removed or goes offline, or the partition is deleted). The system checks the partition status periodically. + +This alarm must be manually cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12014 Major No +======== ============== ========== + +Parameters +---------- + ++---------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| DirName | Specifies the directory for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| PartitionName | Specifies the device partition for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service data fails to be written into the partition, and the service system runs abnormally. + +Possible Causes +--------------- + +- The hard disk is removed. +- The hard disk is offline, or a bad sector exists on the hard disk. + +Procedure +--------- + +#. On FusionInsight Manager, click **O&M > Alarm > Alarms**, and click |image1| in the row where the alarm is located. + +#. Obtain **HostName**, **PartitionName** and **DirName** from **Location**. + +#. Check whether the disk of **PartitionName** on **HostName** is inserted to the correct server slot. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-12014__li9631929173421: + + Contact hardware engineers to remove the faulty disk. + +#. .. _alm-12014__li18162941173421: + + Log in to the **HostName** node where an alarm is reported and check whether there is a line containing **DirName** in the **/etc/fstab** file as user **root**. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +#. .. _alm-12014__li20338192173421: + + Run the **vi /etc/fstab** command to edit the file and delete the line containing **DirName**. + +#. .. _alm-12014__li48826004173421: + + Contact hardware engineers to insert a new disk. For details, see the hardware product document of the relevant model. If the faulty disk is in a RAID group, configure the RAID group. For details, see the configuration methods of the relevant RAID controller card. + +#. Wait 20 to 30 minutes (The disk size determines the waiting time), and run the **mount** command to check whether the disk has been mounted to the **DirName** directory. + + - If yes, manually clear the alarm. No further operation is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-12014__li1607193817587: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +10. Select the **OmmServer** from the Services drop-down list and click **OK**. + +11. Set Start Date for log collection to 10 minutes ahead of the alarm generation time and End Date to 10 minutes behind the alarm generation time and click **Download**. + +12. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system does not automatically clear this alarm, and you need to manually clear the alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383822.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12015_partition_filesystem_readonly.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12015_partition_filesystem_readonly.rst new file mode 100644 index 0000000..646f348 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12015_partition_filesystem_readonly.rst @@ -0,0 +1,71 @@ +:original_name: ALM-12015.html + +.. _ALM-12015: + +ALM-12015 Partition Filesystem Readonly +======================================= + +Description +----------- + +The system checks the partition status every 60 seconds. This alarm is generated when the system detects that a partition to which service directories are mounted enters the read-only mode (due to a bad sector or a faulty file system). The system checks the partition status periodically. + +This alarm is cleared when the system detects that the partition to which service directories are mounted exits from the read-only mode (because the file system is restored to read/write mode, the device is removed, or the device is formatted). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12015 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| DirName | Specifies the directory for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ +| PartitionName | Specifies the device partition for which the alarm is generated. | ++---------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service data fails to be written into the partition, and the service system runs abnormally. + +Possible Causes +--------------- + +The hard disk is faulty, for example, a bad sector exists. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Alarm > Alarms**, click |image1| in the row where the alarm is located. +#. Obtain **HostName** and **PartitionName** from **Location**. **HostName** is the node where the alarm is reported, and **PartitionName** is the partition of the faulty disk. +#. Contact hardware engineers to check whether the disk is faulty. If the disk is faulty, remove it from the server. +#. After the disk is removed, alarm **ALM-12014 Partition Lost** is reported. Handle the alarm. For details, see :ref:`ALM-12014 Partition Lost `. After the alarm **ALM-12014 Partition Lost** is cleared, alarm **ALM-12015 Partition Filesystem Readonly** is automatically cleared. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383823.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..93f1c45 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst @@ -0,0 +1,123 @@ +:original_name: ALM-12016.html + +.. _ALM-12016: + +ALM-12016 CPU Usage Exceeds the Threshold +========================================= + +Description +----------- + +The system checks the CPU usage every 30 seconds and compares the actual CPU usage with the threshold. The CPU usage has a default threshold. This alarm is generated when the CPU usage exceeds the threshold for several times (configurable, 10 times by default) consecutively. + +The alarm is cleared in the following two scenarios: The value of **Trigger Count** is 1 and the CPU usage is smaller than or equal to the threshold; the value of **Trigger Count** is greater than 1 and the CPU usage is smaller than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12016 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service processes respond slowly or become unavailable. + +Possible Causes +--------------- + +- The alarm threshold or alarm smoothing times are incorrect. +- CPU configuration cannot meet service requirements. The CPU usage reaches the upper limit. + +Procedure +--------- + +**Check whether the alarm threshold or alarm Trigger Count are correct.** + +#. Change the alarm threshold and alarm **Trigger Count** based on CPU usage. + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds >** *Name of the desired cluster* > **Host** > **CPU** > **Host CPU Usage** and change the alarm smoothing times based on CPU usage, as shown in :ref:`Figure 1 `. + + .. note:: + + This option defines the alarm check phase. **Trigger Count** indicates the alarm check threshold. An alarm is generated when the number of check times exceeds the threshold. + + .. _alm-12016__fig42676420173938: + + .. figure:: /_static/images/en-us_image_0269383824.png + :alt: **Figure 1** Setting alarm smoothing times + + **Figure 1** Setting alarm smoothing times + + On **Host CPU Usage** page and click **Modify** in the **Operation** column to change the alarm threshold, as shown in :ref:`Figure 2 `. + + .. _alm-12016__fig30961038173938: + + .. figure:: /_static/images/en-us_image_0000001440977805.png + :alt: **Figure 2** Setting an alarm threshold + + **Figure 2** Setting an alarm threshold + +#. After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +**Check whether the CPU usage reaches the upper limit.** + +3. .. _alm-12016__li65266749173938: + + In the alarm list on FusionInsight Manager, click |image1| in the row where the alarm is located to view the alarm host address in the alarm details. + +4. On the **Hosts** page, click the node on which the alarm is reported. + +5. View the CPU usage for 5 minutes. If the CPU usage exceeds the threshold for multiple times, contact the system administrator to add more CPUs. + +6. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-12016__li35735451173938: + + On the FusionInsight Manager in the active cluster, choose **O&M** > **Log > Download**. + +8. Select **OmmServer** from the **Service** and click **OK**. + +9. Set **Start Date** for log collection to 10 minutes ahead of the alarm generation time and **End Date** to 10 minutes behind the alarm generation time in **Time Range** and click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383826.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12017_insufficient_disk_capacity.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12017_insufficient_disk_capacity.rst new file mode 100644 index 0000000..c95adf4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12017_insufficient_disk_capacity.rst @@ -0,0 +1,157 @@ +:original_name: ALM-12017.html + +.. _ALM-12017: + +ALM-12017 Insufficient Disk Capacity +==================================== + +Description +----------- + +The system checks the host disk usage of the system every 30 seconds and compares the actual disk usage with the threshold. The disk usage has a default threshold, this alarm is generated when the host disk usage exceeds the specified threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the usage of a host disk partition is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the usage of a host disk partition is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12017 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| PartitionName | Specifies the device partition for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service processes become unavailable. + +Possible Causes +--------------- + +- The alarm threshold is incorrect. +- Disk configuration of the server cannot meet service requirements. + +Procedure +--------- + +**Check whether the alarm threshold is appropriate.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm >** **Thresholds** **>** *Name of the desired cluster* > **Host** > **Disk** > **Disk Usage** and check whether the threshold (configurable, 90% by default) is appropriate. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12017__li1280611085745: + + Choose **O&M** > **Alarm >** **Thresholds** **>** *Name of the desired cluster* > **Host** > **Disk** > **Disk Usage** and click **Modify** in the **Operation** column to change the alarm threshold based on site requirements. As shown in :ref:`Figure 1 `: + + .. _alm-12017__fig6063892885745: + + .. figure:: /_static/images/en-us_image_0000001440977873.png + :alt: **Figure 1** Setting an alarm threshold + + **Figure 1** Setting an alarm threshold + +#. After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the disk usage reaches the upper limit.** + +4. .. _alm-12017__li2782670585745: + + In the alarm list on FusionInsight Manager, click |image1| in the row where the alarm is located to view the alarm host name and disk partition information in the alarm details. + +5. Log in to the node where the alarm is generated as user **root**. + +6. Run the **df -lmPT \| awk '$2 != "iso9660"' \| grep '^/dev/' \| awk '{"readlink -m "$1 \| getline real }{$1=real; print $0}' \| sort -u -k 1,1** command to check the system disk partition usage. Check whether the disk is mounted to the following directories based on the disk partition name obtained in :ref:`4 `: **/**, **/opt**, **/tmp**, **/var**, **/var/log**, and **/srv/BigData**\ (can be customized). + + - If yes, the disk is a system disk. Then go to :ref:`10 `. + - If no, the disk is not a system disk. Then go to :ref:`7 `. + +7. .. _alm-12017__li1190839985745: + + Run the **df -lmPT \| awk '$2 != "iso9660"' \| grep '^/dev/' \| awk '{"readlink -m "$1 \| getline real }{$1=real; print $0}' \| sort -u -k 1,1** command to check the system disk partition usage. Determine the role of the disk based on the disk partition name obtained in :ref:`4 `. + +8. Check the disk service. + + In MRS, check whether the disk service is HDFS, Yarn, Kafka, Supervisor. + + - If yes, adjust the capacity. Then go to :ref:`9 `. + - If no, go to :ref:`12 `. + +9. .. _alm-12017__li1354951085745: + + After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +10. .. _alm-12017__li6170195385745: + + Run the **find / -xdev -size +500M -execls -l {} \\;** command to check whether a file larger than 500 MB exists on the node and disk. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`12 `. + +11. .. _alm-12017__li3133628885745: + + Handle the large file and check whether the alarm is cleared 2 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +12. .. _alm-12017__li1359113885745: + + Contact the system administrator to expand the disk capacity. + +13. After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Collect fault information.** + +14. .. _alm-12017__li5603307085745: + + On FusionInsight Manager, choose **O&M** > **Log > Download**. + +15. Select **OMS** from the **Service** and click **OK**. + +16. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +17. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383828.png +.. |image2| image:: /_static/images/en-us_image_0269383829.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..daa5a9f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,117 @@ +:original_name: ALM-12018.html + +.. _ALM-12018: + +ALM-12018 Memory Usage Exceeds the Threshold +============================================ + +Description +----------- + +The system checks the memory usage of the system every 30 seconds and compares the actual memory usage with the threshold. The memory usage has a default threshold, this alarm is generated when the value of the memory usage exceeds the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the host memory usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the host memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12018 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service processes respond slowly or become unavailable. + +Possible Causes +--------------- + +- Memory configuration cannot meet service requirements. The memory usage reaches the upper limit. +- The SUSE 12.X OS has an earlier **free** command. The calculated memory usage cannot reflect the real-world memory usage. + +Procedure +--------- + +**Perform the following operations if SUSE 12.X is used.** + +#. Log in to any node in the cluster as user **root**, and run the **cat /etc/*-release** command to check whether the OS is SUSE 12.X as user **root**. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12018__li348492949252: + + Run the **cat /proc/meminfo \| grep Mem** command to check the real-world memory usage of the OS. + + .. code-block:: + + MemTotal: 263576192 kB + MemFree: 198283116 kB + MemAvailable: 227641452 kB + +#. Calculate the real-world memory usage: Memory usage = 1 - (Memory available/Memory total) + + - If the memory usage is lower than 90%, manually disable transferring from monitoring indicators to alarms. + - If the memory usage is higher than 90%, go to :ref:`4 `. + +**Expand the system.** + +4. .. _alm-12018__li5861159252: + + In the alarm list on FusionInsight Manager, click |image1| in the row where the alarm is located to view the alarm host address in the alarm details. + +5. Log in to the host where the alarm is generated as user **root**. + +6. If the memory usage exceeds the threshold, perform memory capacity expansion. + +7. Run the command **free -m \| grep Mem\\: \| awk '{printf("%s,", $3 \* 100 / $2)}'** to check the system memory usage. + +8. Wait for 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-12018__li372014939252: + + On the FusionInsight Manager in the active cluster, choose **O&M** > **Log > Download**. + +10. Select **OmmServer** from the **Servic**\ e and click **OK**. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383830.png +.. |image2| image:: /_static/images/en-us_image_0269383831.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..ea8570a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst @@ -0,0 +1,101 @@ +:original_name: ALM-12027.html + +.. _ALM-12027: + +ALM-12027 Host PID Usage Exceeds the Threshold +============================================== + +Description +----------- + +The system checks the PID usage every 30 seconds and compares the actual PID usage with the default PID usage threshold. This alarm is generated when the system detects that the PID usage exceeds the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the PID usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the PID usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12027 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +No PID is available for new processes and service processes are unavailable. + +Possible Causes +--------------- + +Too many processes are running on the node. You need to increase the value of **pid_max**. + +Procedure +--------- + +**Increase the value of pid_max.** + +#. In the alarm list on FusionInsight Manager, click |image1| in the row where the alarm is located to view the alarm host address in the alarm details. + +#. Log in to the host where the alarm is generated as user **root**. + +#. Run the **cat /proc/sys/kernel/pid_max**\ command to check the value of **pid_max**. + +#. If the PID usage exceeds the threshold, run the command **echo** *new value* **> /proc/sys/kernel/pid_max** to enlarge the value of **pid_max**. + + Example: **echo 65536 > /proc/sys/kernel/pid_max** + + .. note:: + + The maximum value of **pid_max** is as follows: + + - On 32-bit systems: 32768 + - On 64-bit systems: 4194304 (2^22) + +#. Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12027__li377225729750: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +7. Select all services from the **Service** and click **OK**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383832.png +.. |image2| image:: /_static/images/en-us_image_0269383834.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_a_host_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_a_host_exceeds_the_threshold.rst new file mode 100644 index 0000000..57b8add --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_a_host_exceeds_the_threshold.rst @@ -0,0 +1,103 @@ +:original_name: ALM-12028.html + +.. _ALM-12028: + +ALM-12028 Number of Processes in the D State on a Host Exceeds the Threshold +============================================================================ + +Description +----------- + +The system checks the number of processes in the D state of user **omm** on the host every 30 seconds and compares the actual number with the threshold. The number of processes in the D state on the host has a default threshold range. This alarm is generated when the number of processes exceeds the threshold. + +This alarm is cleared when the **Trigger Count** is **1** and the total number of processes in the D state of user **omm** on the host does not exceed the threshold. This alarm is cleared when the **Trigger Count** is greater than **1** and the total number of processes in the D state of user **omm** on the host is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12028 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Excessive system resources are used and service processes respond slowly. + +Possible Causes +--------------- + +The host responds slowly to I/O (disk I/O and network I/O) requests and some processes are in the D state and Z state. + +Procedure +--------- + +**Check the processes in the D state.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and click |image1| to view the IP address of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. () Then run the **su - omm** command to switch to user **omm**. + +#. Run the following command as user **omm** to view the PID of the process that is in the D state: + + **ps -elf \| grep -v "\\[thread_checkio\\]" \| awk 'NR!=1 {print $2, $3, $4}' \| grep omm \| awk -F' ' '{print $1, $3}' \| grep -E "Z|D" \| awk '{print $2}'** + +#. Check whether the command output is empty. + + - If yes, the service process is running properly. Then go to :ref:`6 `. + - If no, go to :ref:`5 `. + +#. .. _alm-12028__li573000391049: + + Switch to user **root** and run the **reboot** command to restart the host for which the alarm is generated. (Restarting a host is risky. Ensure that the service process is normal after the restart.) + +#. .. _alm-12028__li2701143291049: + + Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect the fault information.** + +7. .. _alm-12028__li4177630091049: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Select **OMS** for **Service** and click **OK**. + +9. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895749.png +.. |image2| image:: /_static/images/en-us_image_0263895796.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12033_slow_disk_fault.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12033_slow_disk_fault.rst new file mode 100644 index 0000000..b2fd158 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12033_slow_disk_fault.rst @@ -0,0 +1,305 @@ +:original_name: ALM-12033.html + +.. _ALM-12033: + +ALM-12033 Slow Disk Fault +========================= + +Description +----------- + +- For HDDs, the alarm is triggered when any of the following conditions is met: + + - The system runs the **iostat** command every 3 seconds, and detects that the **svctm** value exceeds 1000 ms for 10 consecutive periods within 30 seconds. + - The system runs the **iostat** command every 3 seconds, and detects that more than 60% of I/O exceeds 150 ms within 300 seconds. + +- For SSDs, the alarm is triggered when any of the following conditions is met: + + - The system runs the **iostat** command every 3 seconds, and detects that the **svctm** value exceeds 1000 ms for 10 consecutive periods within 30 seconds. + - The system runs the **iostat** command every 3 seconds, and detects that more than 60% of I/O exceeds 20 ms within 300 seconds. + +This alarm is automatically cleared when the preceding conditions have not been met for 15 minutes. + +.. note:: + + The formula for calculating **svctm** is as follows: + + svctm = (tot_ticks_new - tot_ticks_old)/(rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old) + + If **rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old** is **0**, then **svctm** is **0**. + + The parameters can be obtained as follows: + + The system runs the **cat /proc/diskstats** command every 3 seconds to collect data. For example: + + |image1| + + In these two commands: + + In the data collected for the first time, the number in the fourth column is the **rd_ios_old** value, the number in the eighth column is the **wr_ios_old** value, and the number in the thirteenth column is the **tot_ticks_old** value. + + In the data collected for the second time, the number in the fourth column is the **rd_ios_new** value, the number in the eighth column is the **wr_ios_new** value, and the number in the thirteenth column is the **tot_ticks_new** value. + + In this case, the value of **svctm** is as follows: + + (19571460 - 19569526)/(1101553 + 28747977 - 1101553 - 28744856) = 0.6197 + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12033 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| DiskName | Specifies the disk for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service performance deteriorates, service processing capabilities become poor, and services may be unavailable. + +Possible Causes +--------------- + +The disk is aged or has bad sectors. + +Procedure +--------- + +**Check the disk status.** + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. + +#. .. _alm-12033__li3788291791458: + + View the detailed information about the alarm. Check the values of **HostName** and **DiskName** in the location information to obtain the information about the faulty disk for which the alarm is generated. + +#. Check whether the node for which the alarm is generated is in a virtualization environment. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-12033__li2831628891458: + + Check whether the storage performance provided by the virtualization environment meets the hardware requirements. Then, go to :ref:`5 `. + +#. .. _alm-12033__li1205527419227: + + Log in to the alarm node as user **root**, run the **df -h** command, and check whether the command output contains the value of the **DiskName** field. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12033__li2325719119312: + + Run the **lsblk** command to check whether the mapping between the value of **DiskName** and the disk has been created. + + |image2| + + - If yes, go to :ref:`7 `. . + - If no, go to :ref:`22 `. + +#. .. _alm-12033__li2583597491458: + + Log in to the alarm node as user **root**, run the **lsscsi \| grep "/dev/sd[x]"** command to view the disk information, and check whether RAID has been set up. + + .. note:: + + In the command, **/dev/sd[x]** indicates the disk name obtained in :ref:`2 `. + + Example: + + **lsscsi \| grep "/dev/sda"** + + In the command output, if **ATA**, **SATA**, or **SAS** is displayed in the third line, the disk has not been organized into a RAID group. If other information is displayed, RAID has been set up. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`8 `. + +#. .. _alm-12033__li523387391458: + + Run the **smartctl -i /dev/sd[x]** command to check whether the hardware supports the SMART tool. + + Example: + + **smartctl -i /dev/sda** + + In the command output, if "SMART support is: Enabled" is displayed, the hardware supports SMART. If "Device does not support SMART" or other information is displayed, the hardware does not support SMART. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`17 `. + +#. .. _alm-12033__li3483730991458: + + Run the **smartctl -H --all /dev/sd[x]** command to check basic SMART information and determine whether the disk is working properly. + + Example: + + **smartctl -H --all /dev/sda** + + Check the value of **SMART overall-health self-assessment test result** in the command output. If the value is **FAILED**, the disk is faulty and needs to be replaced. If the value is **PASSED**, check the value of **Reallocated_Sector_Ct** or **Elements in grown defect list**. If the value is greater than 100, the disk is faulty and needs to be replaced. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li1145378391458: + + Run the **smartctl -l error -H /dev/sd[x]** command to check the Glist of the disk and determine whether the disk is normal. + + Example: + + **smartctl -l error -H /dev/sda** + + Check the **Command/Feature_name** column in the command output. If **READ SECTOR(S)** or **WRITE SECTOR(S)** is displayed, the disk has bad sectors. If other errors occur, the disk circuit board is faulty. Both errors indicate that the disk is abnormal and needs to be replaced. + + If "No Errors Logged" is displayed, no error log exists. You can perform step 9 to trigger the disk SMART self-check. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li2167780691458: + + Run the **smartctl -t long /dev/sd[x]** command to trigger the disk SMART self-check. After the command is executed, the time when the self-check is to be completed is displayed. After the self-check is completed, repeat :ref:`9 ` and :ref:`10 ` to check whether the disk is working properly. + + Example: + + **smartctl -t long /dev/sda** + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li1471607091458: + + Run the **smartctl -d [sat|scsi]+megaraid,[DID] -H --all /dev/sd[x]** command to check whether the hardware supports SMART. + + .. note:: + + - In the command, **[sat|scsi]** indicates the disk type. Both types need to be used. + - **[DID]** indicates the slot information. Slots 0 to 15 need to be used. + + For example, run the following commands in sequence: + + **smartctl -d sat+megaraid,0 -H --all /dev/sda** + + **smartctl -d sat+megaraid,1 -H --all /dev/sda** + + **smartctl -d sat+megaraid,2 -H --all /dev/sda** + + ... + + Try the command combinations of different disk types and slot information. If "SMART support is: Enabled" is displayed in the command output, the disk supports SMART. Record the parameters of the disk type and slot information when a command is successfully executed. If "SMART support is: Enabled" is not displayed in the command output, the disk does not support SMART. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`16 `. + +#. .. _alm-12033__li4568369291458: + + Run the **smartctl -d [sat|scsi]+megaraid,[DID] -H --all /dev/sd[x]** command recorded in :ref:`12 ` to check basic SMART information and determine whether the disk is normal. + + Example: + + **smartctl -d sat+megaraid,2 -H --all /dev/sda** + + Check the value of **SMART overall-health self-assessment test result** in the command output. If the value is **FAILED**, the disk is faulty and needs to be replaced. If the value is **PASSED**, check the value of **Reallocated_Sector_Ct** or **Elements in grown defect list**. If the value is greater than 100, the disk is faulty and needs to be replaced. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li5027541391458: + + Run the **smartctl -d [sat|scsi]+megaraid,[DID] -l error -H /dev/sd[x]** command to check the Glist of the disk and determine whether the hard disk is working properly. + + Example: + + **smartctl -d sat+megaraid,2 -l error -H /dev/sda** + + Check the **Command/Featrue_name** column in the command output. If **READ SECTOR(S)** or **WRITE SECTOR(S)** is displayed, the disk has bad sectors. If other errors occur, the disk circuit board is faulty. Both errors indicate that the disk is abnormal and needs to be replaced. + + If "No Errors Logged" is displayed, no error log exists. You can perform step 9 to trigger the disk SMART self-check. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li1119862391458: + + Run the **smartctl -d [sat|scsi]+megaraid,[DID] -t long /dev/sd[x]** command to trigger the disk SMART self-check. After the command is executed, the time when the self-check is to be completed is displayed. After the self-check is completed, repeat :ref:`13 ` and :ref:`14 ` to check whether the disk is working properly. + + Example: + + **smartctl -d sat+megaraid,2 -t long /dev/sda** + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`18 `. + +#. .. _alm-12033__li1606413991458: + + If the configured RAID controller card does not support SMART, the disk does not support SMART. In this case, use the check tool provided by the corresponding RAID controller card vendor to rectify the fault. Then go to :ref:`17 `. + + For example, LSI is a MegaCLI tool. + +#. .. _alm-12033__li3381567991458: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, click **Clear** in the **Operation** column of the alarm, and check whether the alarm is reported on the same disk again. + + If the alarm is reported for three times, replace the disk. + + - If yes, go to :ref:`18 `. + - If no, no further action is required. + +**Replace the disk.** + +18. .. _alm-12033__li6235920691458: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. + +19. View the detailed information about the alarm. Check the values of **HostName** and **DiskName** in the location information to obtain the information about the faulty disk for which the alarm is reported. + +20. Replace the disk. + +21. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`22 `. + +**Collect the fault information.** + +22. .. _alm-12033__li4518231891458: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +23. Select **OMS** for **Service** and click **OK**. + +24. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +25. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001410107141.png +.. |image2| image:: /_static/images/en-us_image_0263895818.jpg +.. |image3| image:: /_static/images/en-us_image_0263895453.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12034_periodical_backup_failure.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12034_periodical_backup_failure.rst new file mode 100644 index 0000000..26b126e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12034_periodical_backup_failure.rst @@ -0,0 +1,127 @@ +:original_name: ALM-12034.html + +.. _ALM-12034: + +ALM-12034 Periodical Backup Failure +=================================== + +Description +----------- + +The system executes the periodic backup task every 60 minutes. This alarm is generated when a periodical backup task fails to be executed. This alarm is cleared when the next backup task is executed successfully. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12034 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| TaskName | Specifies the task. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +There are not available backup packages for a long time, so the system cannot be restored in case of exceptions. + +Possible Causes +--------------- + +The alarm cause depends on the task details. Handle the alarm according to the logs and alarm details. + +Procedure +--------- + +**Check whether the disk space is sufficient.** + +#. In the FusionInsight Manager portal, click **O&M > Alarm > Alarms**. + +#. In the alarm list, click |image1| in the row where the alarm is located and obtain **TaskName** from **Location**. + +#. Choose **O&M** > **Backup and Restoration > Backup Management**. + +#. Search for the backup task based on **TaskName** and click **More** in the **Operation** column. In the displayed dialog box, click **View History** and view the task details. + +#. In the displayed dialog box and click |image2| to check whether the following message is displayed: Failed to backup xx due to insufficient disk space, move the data in the xx directory to other directories. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`13 `. + +#. .. _alm-12034__li8265923133114: + + Choose **Backup Path** > **View** and obtain the **Backup Path**. + +#. Log in to the node as user **root** and run the following command to check the node mounting details: + + **df -h** + +#. Check whether the available space of the node to which the backup path is mounted is less than 20 GB. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`13 `. + +#. .. _alm-12034__li181154133220: + + Check whether there are many backup packages in the backup directory. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`13 `. + +#. .. _alm-12034__li3795101373317: + + Enable the available space of the node to which the backup directory is mounted to be greater than 20 GB by moving backup packages out of the backup directory or delete the backup packages. + +#. After the problem is resolved, perform the backup task again and check whether the backup task execution is successful. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +#. .. _alm-12034__li5916521794522: + + After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect fault information.** + +13. .. _alm-12034__li115006411351: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +14. Select **Controller** from the **Service** and click **OK**. + +15. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383843.png +.. |image2| image:: /_static/images/en-us_image_0000001127057881.png +.. |image3| image:: /_static/images/en-us_image_0269383844.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst new file mode 100644 index 0000000..795ad1d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst @@ -0,0 +1,106 @@ +:original_name: ALM-12035.html + +.. _ALM-12035: + +ALM-12035 Unknown Data Status After Recovery Task Failure +========================================================= + +Description +----------- + +After the recovery task fails, the system automatically rolls back every 60 minutes. If the rollback fails, data may be lost. If this occurs, an alarm is reported. This alarm is cleared when the next recovery task execution is successful. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12035 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| TaskName | Specifies the task. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +After the recovery task fails, the system automatically rolls back. If the rollback fails, data may be lost or the data status may be unknown, which may affect services. + +Possible Causes +--------------- + +The alarm cause depends on the task details. Handle the alarm according to the logs and alarm details. + +Procedure +--------- + +**Collect fault information.** + +#. In the FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Services**, and check whether the running status of the component meets the requirements. (The OMS and DBService must be in the normal state, and other components must be stopped.) + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12035__li16912228111613: + + Restore the component status as required and start the recovery task again. + +#. Log in to the FusionInsight Manager portal and click **O&M > Alarm > Alarms**. + +#. In the alarm list, click |image1| in the row where the alarm is located to obtain **TaskName** from **Location**. + +#. Choose **O&M** > **Backup and Restoration > Restoration Management**. + +#. Find the restoration task by **Task Name** and view the task details. + +#. Perform the recovery task again and check whether the recovery task execution is successful. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-12035__li691272812168: + + After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-12035__li18912172820165: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **Controller** from the **Service** and click **OK**. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383845.png +.. |image2| image:: /_static/images/en-us_image_0269383846.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12038_monitoring_indicator_dumping_failure.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12038_monitoring_indicator_dumping_failure.rst new file mode 100644 index 0000000..894d56b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12038_monitoring_indicator_dumping_failure.rst @@ -0,0 +1,150 @@ +:original_name: ALM-12038.html + +.. _ALM-12038: + +ALM-12038 Monitoring Indicator Dumping Failure +============================================== + +Description +----------- + +After monitoring indicator dumping is configured on FusionInsight Manager, the system checks the monitoring indicator dumping result at the dumping interval (60 seconds by default). This alarm is generated when the dumping fails. + +This alarm is cleared when dumping is successful. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12038 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The upper-layer management system cannot obtain monitoring indicators from the FusionInsight Manager system. + +Possible Causes +--------------- + +- The server cannot be connected. +- The save path on the server cannot be accessed. +- The monitoring indicator file fails to be uploaded. + +Procedure +--------- + +**Check whether the server connection is normal.** + +#. Check whether the network between the FusionInsight Manager system and the server is normal. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12038__li59131350103617: + + Contact the network administrator to recover the network and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-12038__li44378490103617: + + Choose **System** > **Interconnection > Upload Performance Data** and check whether the FTP username, password, port, dump mode, and public key configured on the upload performance data page are consistent with the configuration on the server. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12038__li38260071103617: + + Enter the correct configuration information, click **OK**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the permission of the save path on the server is correct.** + +5. .. _alm-12038__li31439394103617: + + Choose **System** > **Interconnection > Upload Performance Data** and check the configuration items **FTP Username**, **Save Path**, and **Dump Mode**. + + - If the dump mode is FTP, go to :ref:`6 `. + - If the dump mode is SFTP, go to :ref:`7 `. + +6. .. _alm-12038__li58736977103617: + + Log in to the server in FTP mode. In the default path, check whether **FTP Username** has the read and write permission of the relative path **Save Path**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +7. .. _alm-12038__li38059143103617: + + Log in to the server in SFTP mode and check whether **FTP Username** has the read and write permission of the absolute path **Save Path**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +8. .. _alm-12038__li47558825103617: + + Add the read and write permission and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Check whether the save path on the server has sufficient disk space.** + +9. .. _alm-12038__li35446984103617: + + Log in to the server and check whether the save path has sufficient disk space. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`10 `. + +10. .. _alm-12038__li53095195103617: + + Delete unnecessary files or go to the monitoring indicator dumping configuration page to change the save path. Then, check whether the save path has sufficient disk space. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-12038__li51692141103617: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +12. Select **OMS** from the **Service** and click **OK**. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383850.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12039_active_standby_oms_databases_not_synchronized.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12039_active_standby_oms_databases_not_synchronized.rst new file mode 100644 index 0000000..667a497 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12039_active_standby_oms_databases_not_synchronized.rst @@ -0,0 +1,153 @@ +:original_name: ALM-12039.html + +.. _ALM-12039: + +ALM-12039 Active/Standby OMS Databases Not Synchronized +======================================================= + +Description +----------- + +The system checks the data synchronization status between the active and standby OMS Databases every 10 seconds. This alarm is generated when the synchronization status cannot be queried for 30 consecutive times or when the synchronization status is abnormal. + +This alarm is cleared when the data synchronization status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12039 Critical Yes +======== ============== ========== + +Parameters +---------- + ++---------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=====================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++---------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++---------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------------+-------------------------------------------------------------------+ +| Local GaussDB HA IP | Specifies the HA IP address of the local GaussDB. | ++---------------------+-------------------------------------------------------------------+ +| Peer GaussDB HA IP | Specifies the HA IP address of the peer GaussDB. | ++---------------------+-------------------------------------------------------------------+ +| SYNC_PERCENT | Specifies the synchronization percentage. | ++---------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When data is not synchronized between the active and standby OMS Databases, data may be lost or abnormal if the active instance becomes abnormal. + +Possible Causes +--------------- + +- The network between the active and standby nodes is unstable. +- The standby OMS Database is abnormal. +- The standby node disk space is full. + +Procedure +--------- + +**Check whether the network between the active and standby nodes is normal.** + +#. Log in to FusionInsight Manager, click **O&M > Alarm > Alarms**, click |image1| in the row where the alarm is located, and query the standby OMS Database IP address. + +#. Log in to the active OMS Database node as user **root**. + +#. Run the **ping** *Standby OMS Database heartbeat IP address* command to check whether the standby OMS Database node is reachable. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12039__li36080609104950: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12039__li35036231104950: + + Rectify the network fault and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether the standby OMS Database is normal.** + +6. .. _alm-12039__li19362442104950: + + Log in to the standby OMS Database node as user **root**. + +7. Run the **su - omm** command to switch to user **omm**. + +8. Go to the **${BIGDATA_HOME}/om-server/om/sbin/** directory and run the **./status-oms.sh** command to check whether the OMS Database resource status of the standby DBService is normal. In the command output, check whether the following information is displayed in the row where **ResName** is **gaussDB**: + + For example: + + .. code-block:: + + 10_10_10_231 gaussDB Standby_normal Normal Active_standby + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`16 `. + +**Check whether the standby node disk space is full.** + +9. .. _alm-12039__li14387074104950: + + Log in to the standby OMS Database node as user **root**. + +10. Run the **su - omm** command to switch to user **omm**. + +11. Run the **echo ${BIGDATA_DATA_HOME}/dbdata_om** command to obtain the OMS Database data directory. + +12. Run the **df -h** command to view the system disk partition usage information. + +13. Check whether the disk where the OMS Database data directory is mounted is full. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`16 `. + +14. .. _alm-12039__li27597409104950: + + Expand the disk capacity. + +15. After the disk capacity is expanded, wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +**Collect fault information.** + +16. .. _alm-12039__li64121842104950: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +17. Select **OMMServer** from the **Service** and click **OK**. + +18. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +19. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383851.png +.. |image2| image:: /_static/images/en-us_image_0269383852.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12040_insufficient_system_entropy.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12040_insufficient_system_entropy.rst new file mode 100644 index 0000000..054c525 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12040_insufficient_system_entropy.rst @@ -0,0 +1,169 @@ +:original_name: ALM-12040.html + +.. _ALM-12040: + +ALM-12040 Insufficient System Entropy +===================================== + +Description +----------- + +The system checks the entropy for five consecutive times at 00:00 every day. Specifically, the system checks whether rng-tools or haveged has been enabled and correctly configured. If neither is configured, the system continues to check the entropy. If the entropy is less than 100 for five consecutive times, this alarm is reported. + +This alarm is cleared when the system detects that the true random number mode has been configured, the random number parameters have been configured in the pseudo-random number mode, or neither mode is configured but the entropy of the OS is greater than or equal to 100 in at least one of five entropy checks. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12040 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The system is not running properly. + +Possible Causes +--------------- + +- rng-tools or haveged has not been installed or started. +- The entropy of the OS is smaller than 100 for multiple consecutive times. + +Procedure +--------- + +**Check whether haveged or rng-tools has been installed or started.** + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. + +#. Check the value of **HostName** in the **Location** area to obtain the name of the host for which the alarm is generated. + +#. Log in to the node for which the alarm is generated as user **root**. + +#. Run the **/bin/rpm -qa \| grep -w "haveged"** command to check the haveged installation status and check whether the command output is empty. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`5 `. + +#. .. _alm-12040__li35057727105655: + + Run the **/sbin/service haveged status \|grep "running"** command and check the command output. + + - If the command is executed successfully, haveged has been installed and configured correctly and is running properly. Go to :ref:`8 `. + + - If the command fails to execute, haveged is not running properly. Run the following command to manually restart haveged and go to :ref:`9 `: + + **systemctl restart haveged.service** + +#. .. _alm-12040__li978924652119: + + Run the **/bin/rpm -qa \| grep -w "rng-tools"** command to check the rng-tools installation and check whether the command output is empty. + + - If yes, contact the OS vendor to install and start haveged or rng-tools. Then go to :ref:`9 `. + - If no, go to :ref:`7 `. + +#. .. _alm-12040__li34867421105655: + + Run the **ps -ef \| grep -v "grep" \| grep rngd \| tr -d " " \| grep "\\-r/dev/urandom"** command and check the command output. + + - If the command is executed successfully, rngd has been installed and configured correctly and is running properly. Go to :ref:`8 `. + + - If the command fails to execute, rngd is not running properly. Run the following command to manually restart rngd and go to :ref:`9 `: + + **systemctl restart rngd.service** + +**Check the entropy of the OS.** + +8. .. _alm-12040__li22912175218: + + Manually check the entropy of the OS. + + Log in to the target node as user **root** and run the **cat /proc/sys/kernel/random/entropy_avail** command to check whether the entropy of the OS meets cluster installation requirements (no less than 100). + + - If yes, the entropy of the OS is not less than 100. Go to :ref:`9 `. + - If no, the entropy of the OS is less than 100. Use either of the following methods and go to :ref:`9 `. + + - Method 1: Use haveged (true random number mode). Contact the OS vendor to install and start haveged. + + In Kylin, run the following command: + + **vi /usr/lib/systemd/system/haveged.service** + + Configure **Type**, **ExecStar**, **SuccessExitStatus**, and **Restart** in **[Service]** as follows: + + .. code-block:: + + Type=simple + ExecStar=/usr/sbin/haveged -w 1024 -v 1 -Foreground + SuccessExitStatus=137 143 + Restart=always + + - Method 2: Use rng-tools (pseudo-random number mode). Contact the OS vendor to install and start rng-tools and configure it based on the OS type. + + - In Red Hat Linux or CentOS, run the following commands: + + **echo 'EXTRAOPTIONS="-r /dev/urandom -o /dev/random -t 1 -i"' >> /etc/sysconfig/rngd** + + **service rngd start** + + **chkconfig rngd on** + + - In SUSE, run the following commands: + + **rngd -r /dev/urandom -o /dev/random** + + **echo "rngd -r /dev/urandom -o /dev/random" >> /etc/rc.d/after.local** + + - In Kylin, run the following command as user **root** on the node where the alarm is reported: + + **vi /usr/lib/systemd/system/rngd.service** + + Change the value of **ExecStart** in **[Service]** as follows: + + .. code-block:: + + ExecStart=/sbin/rngd -f -r /dev/urandom -s 2048 + +9. .. _alm-12040__li20231214524: + + Wait until the system to check the entropy at 00:00 on the following day and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Collect fault information.** + +10. .. _alm-12040__li5962839105655: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +11. Select **NodeAgent** for **Service** and click **OK**. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12041_incorrect_permission_on_key_files.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12041_incorrect_permission_on_key_files.rst new file mode 100644 index 0000000..813ba8d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12041_incorrect_permission_on_key_files.rst @@ -0,0 +1,108 @@ +:original_name: ALM-12041.html + +.. _ALM-12041: + +ALM-12041 Incorrect Permission on Key Files +=========================================== + +Description +----------- + +The system checks whether the permission, user, and user group information about critical directories or files is normal every 5 minutes. This alarm is generated when the information is abnormal. + +This alarm is cleared when the information becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12041 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| PathName | Specifies the path or name of the abnormal file. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +System functions are unavailable. + +Possible Causes +--------------- + +The file permission is abnormal or the file is lost due to a user manually modified information such as the file permission, user, and user group, or the system is powered off unexpectedly. + +Procedure +--------- + +**Check whether the abnormal file exists and whether the permission on the abnormal file is correct.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. + +#. Check the value of **HostName** to obtain the host name involved in this alarm. Check the value of **PathName** to obtain the path or name of the abnormal file. + +#. Log in to the node for which the alarm is generated as user **root**. + +#. Run the **ll** *pathName* command, where *pathName* indicates the name of the abnormal file to obtain the user, permission, and user group information about the file or directory. + +#. .. _alm-12041__li1834285111014: + + Go to **${BIGDATA_HOME}/om-agent/nodeagent/etc/agent/autocheck** directory. Then run the **vi keyfile** command and search for the name of the abnormal file and check the due permission of the file. + + .. note:: + + To ensure proper configuration synchronization between the active and standby OMS servers, files, directories, and files and sub-directories in the directories configured in **$OMS_RUN_PATH/workspace/ha/module/hasync/plugin/conf/filesync.xml** will also be monitored except files and directories in **keyfile**. User **omm** must have read and write permissions of files and read and execute permissions of directories. + +#. Compare the real-world permission of the file with the due permission obtained in :ref:`5 ` and correct the permission, user, and user group information for the file. + +#. Wait a hour and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + + .. note:: + + If the disk partition where the cluster installation directory resides is used up, some temporary files will be generated in the program installation directory when running the **sed** command fails. Users do not have the read, write, and execute permissions of these temporary files. The system reports an alarm indicating that permissions of temporary files are abnormal if these files are within the monitoring range of the alarm. Perform the preceding alarm handling processes to clear the alarm. Alternatively, you can directly delete the temporary files after confirming that files with abnormal permissions are temporary. The temporary file generated after a **sed** command execution failure is similar to the following. + + |image1| + +**Collect fault information.** + +8. .. _alm-12041__li1068683211014: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +9. Select **NodeAgent** from the **Service** and click **OK**. + +10. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383855.jpg +.. |image2| image:: /_static/images/en-us_image_0269383856.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12042_incorrect_configuration_of_key_files.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12042_incorrect_configuration_of_key_files.rst new file mode 100644 index 0000000..22335c3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12042_incorrect_configuration_of_key_files.rst @@ -0,0 +1,123 @@ +:original_name: ALM-12042.html + +.. _ALM-12042: + +ALM-12042 Incorrect Configuration of Key Files +============================================== + +Description +----------- + +The system checks whether critical configurations are correct every 5 minutes. This alarm is generated when the configurations are abnormal. + +This alarm is cleared when the configurations become normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12042 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| PathName | Specifies the path or name of the abnormal file. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Functions related to the file are abnormal. + +Possible Causes +--------------- + +The file configuration is modified manually or the system is powered off unexpectedly. + +Procedure +--------- + +**Check abnormal file configuration.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. + +#. Check the value of **HostName** to obtain the host name involved in this alarm. Check the value of **PathName** to obtain the path or name of the abnormal file. + +#. Log in to the node for which the alarm is generated as user **root**. + +#. View the $BIGDATA_LOG_HOME/nodeagent/scriptlog/checkfileconfig.log file and analyze the cause based on the error log. Locate the check standards of the file in the :ref:`Related Information ` and manually check and modify the file based on the standards. + + Run the **vi** *file name* command to enter the editing mode, and then press **Insert** to start editing. + + After the modification is complete, press **Esc** to exit the editing mode and enter **:wq** to save the settings and exit. + + For example: + + **vi /etc/ssh/sshd_config** + +#. Wait a hour and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12042__li1843685711310: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select **NodeAgent** from the **Service** and click **OK**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +.. _alm-12042__en-us_topic_0070543617_cab: + +Related Information +------------------- + +- **Check standards of /etc/fstab** + + Check whether the partitions configured in the **/etc/fstab** file can be found in **/proc/mounts**. + + Check whether the swap partitions configured in fstab correspond to those in /proc/swaps. + +- **Check the /etc/hosts configuration file.** + + Run **cat /ect/hosts**. If any of the following situations occurs, the **/etc/hosts** configuration file is abnormal: + + #. The **/etc/hosts** file does not exist. + #. The host name is not configured in the file. + #. The host name maps to multiple IP addresses in the file. + #. The IP address corresponding to the host name does not exist in the command output of the **ifconfig** command. + #. One IP address maps to multiple host names in the file. + +- **Check standards of /etc/ssh/sshd_config** + + Run the **vi /etc/ssh/sshd_config** command to check whether configuration items are configured as follows: + + #. The value of **UseDNS** must be set to **no**. + #. The value of **MaxStartups** must be greater than or equal to 1000. + #. At least one of the **PasswordAuthentication** and **ChallengeResponseAuthentication** parameters must be left blank or at least one of the parameters be set to **yes**. + +.. |image1| image:: /_static/images/en-us_image_0269383857.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..31db5a2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst @@ -0,0 +1,295 @@ +:original_name: ALM-12045.html + +.. _ALM-12045: + +ALM-12045 Read Packet Dropped Rate Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the read packet dropped rate every 30 seconds. This alarm is generated when the read packet dropped rate exceeds the threshold (the default threshold is 0.5%) for multiple times (the default value is **5**). + +To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Dropped Rate**. + +This alarm is cleared when **Trigger Count** is 1 and the read packet dropped rate is less than or equal to the threshold. This alarm is cleared when **Trigger Count** is greater than 1 and the read packet dropped rate is less than or equal to 90% of the threshold. + +The alarm detection is disabled by default. If you want to enable this function, check whether this function can be enabled based on Checking System Environments. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12045 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| PortName | Specifies the network port for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service performance deteriorates or some services time out. + +Risk warning: In SUSE kernel 3.0 or later or Red Hat 7.2, the system kernel modifies the mechanism for counting the number of dropped read packets. In this case, this alarm may be generated even if the network is running properly, but services are not affected. You are advised to check the system environment first. + +Possible Causes +--------------- + +- An OS exception occurs. +- The NICs are bonded in active/standby mode. +- The alarm threshold is improperly configured. +- The network quality is poor. + +Procedure +--------- + +**View the network packet dropped rate.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the page that is displayed, click |image1| in the row containing the alarm, and view the name of the host for which the alarm is generated and the NIC name. + +#. Log in to the alarm node as user **omm**, and run the **/sbin/ifconfig** *NIC name* command to check whether packet loss occurs on the network. + + |image2| + + .. note:: + + - *IP address of the node for which the alarm is generated*: Query the IP address of the node for which the alarm is generated on the **Hosts** page of FusionInsight Manager based on the value of **HostName** in the alarm location information. Check both the IP addresses of the management plane and service plane. + - Packet loss rate = (Number of dropped packets/Total number of received packets) x 100%. If the packet loss rate is greater than the system threshold (0.5% by default), read packets are dropped. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`3 `. + +**Check the system environment.** + +3. .. _alm-12045__li6542838717657: + + Log in to the active OMS node or the alarm node as user **omm**. + +4. Run the **cat /etc/*-release** command to check the OS type. + + - For Red Hat Enterprise Linux, go to :ref:`5 `. + + .. code-block:: + + # cat /etc/*-release + Red Hat Enterprise Linux Server release 7.2 (Santiago) + + - For SUSE Linux, go to :ref:`6 `. + + .. code-block:: + + # cat /etc/*-release + SUSE Linux Enterprise Server 11 (x86_64) + VERSION = 11 + PATCHLEVEL = 3 + + - For other OS types, go to :ref:`11 `. + +5. .. _alm-12045__li5563721171656: + + Run the **cat /etc/redhat-release** command to check whether the OS version is **Red Hat 7.2 (x86)** or **Red Hat 7.4 (TaiShan)**. + + .. code-block:: + + # cat /etc/redhat-release + Red Hat Enterprise Linux Server release 7.2 (Santiago) + + - If yes, the alarm sending function cannot be enabled. Go to :ref:`7 `. + - If no, go to :ref:`11 `. + +6. .. _alm-12045__li42309040172040: + + Run the **cat /proc/version** command to check whether the SUSE kernel version is 3.0 or later. + + .. code-block:: + + # cat /proc/version + Linux version 3.0.101-63-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Tue Jun 23 16:02:31 UTC 2015 (4b89d0c) + + - If yes, the alarm sending function cannot be enabled. Go to :ref:`7 `. + - If no, go to :ref:`11 `. + +7. .. _alm-12045__li43950618195120: + + Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Threshold Configuration**. + +8. In the navigation tree of the **Thresholds** page, choose *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Dropped Rate**. In the area on the right, check whether the **Switch** is toggled on. + + - If yes, the alarm sending function is enabled. Go to :ref:`9 `. + - If no, the alarm sending function is disabled. Go to :ref:`10 `. + +9. .. _alm-12045__li38517503111027: + + In the area on the right, toggle **Switch** off to disable the checking of **Network Read Packet Dropped Rate Exceeds the Threshold**. + + |image3| + +10. .. _alm-12045__li16613085112024: + + On the **Alarm** page of FusionInsight Manager, search for alarm **12045** and manually clear the alarm if it is not automatically cleared. No further action is required. + + |image4| + + .. note:: + + ID of the Network Read Packet Dropped Rate Exceeds the Threshold alarm is **12045**. + +**Check whether the NICs are bonded in active/standby mode.** + +11. .. _alm-12045__li4196511811134: + + Log in to the alarm node as user **omm** and run the **ls -l /proc/net/bonding** command to check whether the **/proc/net/bonding** directory exists on the node. + + - If yes, the bond mode is configured for the node. Go to :ref:`12 `. + + .. code-block:: + + # ls -l /proc/net/bonding/ + total 0 + -r--r--r-- 1 root root 0 Oct 11 17:35 bond0 + + - If no, the bond mode is not configured for the node. Go to :ref:`14 `. + + .. code-block:: + + # ls -l /proc/net/bonding/ + ls: cannot access /proc/net/bonding/: No such file or directory + +12. .. _alm-12045__li56651960171744: + + Run the **cat /proc/net/bonding/**\ *bond0* command to check whether the value of **Bonding Mode** in the configuration file is **fault-tolerance**. + + .. note:: + + In the command, **bond0** indicates the name of the bond configuration file. Use the file name obtained in :ref:`11 `. + + .. code-block:: + + # cat /proc/net/bonding/bond0 + Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) + + Bonding Mode: fault-tolerance (active-backup) + Primary Slave: eth1 (primary_reselect always) + Currently Active Slave: eth1 + MII Status: up + MII Polling Interval (ms): 100 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth0 + MII Status: up + Speed: 1000 Mbps + Duplex: full + Link Failure Count: 1 + Slave queue ID: 0 + + Slave Interface: eth1 + MII Status: up + Speed: 1000 Mbps + Duplex: full + Link Failure Count: 1 + Slave queue ID: 0 + + - If yes, the NICs are bonded in active/standby mode. Go to :ref:`13 `. + - If no, go to :ref:`14 `. + +13. .. _alm-12045__li44376005172456: + + Check whether the NIC specified by **NetworkCardName** in the alarm is the standby NIC. + + - If yes, the alarm of the standby NIC cannot be automatically cleared. Manually clear the alarm on the alarm management page. No further action is required. + - If no, go to :ref:`14 `. + + .. note:: + + To determine the standby NIC, check the **/proc/net/bonding/bond0** configuration file. If the NIC name corresponding to **NetworkCardName** is **Slave Interface** but not **Currently Active Slave** (the current active NIC), the NIC is the standby one. + +**Check whether the threshold is set properly.** + +14. .. _alm-12045__li61276131112834: + + Log in to FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Dropped Rate**, and check whether the alarm threshold is configured properly. The default value is **0.5%**. You can adjust the threshold as needed. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`15 `. + +15. .. _alm-12045__li47653126112834: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Dropped Rate**. Click **Modify** in the **Operation** column to change the threshold. See :ref:`Figure 1 `. + + .. _alm-12045__fig52784093112834: + + .. figure:: /_static/images/en-us_image_0000001390618884.png + :alt: **Figure 1** Configuring the alarm threshold + + **Figure 1** Configuring the alarm threshold + +16. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`17 `. + +**Check whether the network connection is normal.** + +17. .. _alm-12045__li56023883112834: + + Contact the network administrator to check whether the network is normal. + + - If yes, rectify the fault and go to :ref:`18 `. + - If no, go to :ref:`19 `. + +18. .. _alm-12045__li4503547112834: + + After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +**Collect the fault information.** + +19. .. _alm-12045__li40531926112834: + + On FusionInsight Manager of the active cluster, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +20. Select **OMS** for **Service** and click **OK**. + +21. Expand the **Hosts** dialog box and select the alarm node and the active OMS node. + +22. Click |image5| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +23. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895776.png +.. |image2| image:: /_static/images/en-us_image_0000001390459688.png +.. |image3| image:: /_static/images/en-us_image_0263895526.png +.. |image4| image:: /_static/images/en-us_image_0263895376.png +.. |image5| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..f9d5649 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst @@ -0,0 +1,124 @@ +:original_name: ALM-12046.html + +.. _ALM-12046: + +ALM-12046 Write Packet Dropped Rate Exceeds the Threshold +========================================================= + +Description +----------- + +The system checks the write packet dropped rate every 30 seconds. This alarm is generated when the write packet dropped rate exceeds the threshold (the default threshold is 0.5%) for multiple times (the default value is **5**). + +To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Dropped Rate**. + +If **Trigger Count** is **1**, this alarm is cleared when the network write packet dropped rate is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the network write packet dropped rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12046 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Port Name | Specifies the network port for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service performance deteriorates or some services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network quality is poor. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Dropped Rate**, and check whether the alarm threshold is configured properly. The default value is **0.5%**. You can adjust the threshold as needed. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12046__li5699560811450: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Dropped Rate**. Click **Modify** in the **Operation** column to change the threshold. + + See :ref:`Figure 1 `. + + .. _alm-12046__fig153215311450: + + .. figure:: /_static/images/en-us_image_0000001390459444.png + :alt: **Figure 1** Configuring the alarm threshold + + **Figure 1** Configuring the alarm threshold + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network connection is normal.** + +4. .. _alm-12046__li4369794811450: + + Contact the network administrator to check whether the network is normal. + + - If yes, rectify the fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-12046__li6056359711450: + + After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12046__li820146511450: + + On FusionInsight Manager of the active cluster, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Select **OMS** for **Service** and click **OK**. + +8. Expand the **Hosts** dialog box and select the alarm node and the active OMS node. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..19061f3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst @@ -0,0 +1,124 @@ +:original_name: ALM-12047.html + +.. _ALM-12047: + +ALM-12047 Read Packet Error Rate Exceeds the Threshold +====================================================== + +Description +----------- + +The system checks the read packet error rate every 30 seconds. This alarm is generated when the read packet error rate exceeds the threshold (the default threshold is **0.5%**) for multiple times (the default value is **5**). + +To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Error Rate**. + +If **Trigger Count** is **1**, this alarm is cleared when the read packet error rate is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the read packet error rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12047 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Port Name | Specifies the network port for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The communication is intermittently interrupted, and services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network quality is poor. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Error Rate**, and check whether the alarm threshold is configured properly. The default value is **0.5%**. You can adjust the threshold as needed. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12047__li18938060144325: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Packet Error Rate**. Click **Modify** in the **Operation** column to change the threshold. + + See :ref:`Figure 1 `. + + .. _alm-12047__fig35859496144325: + + .. figure:: /_static/images/en-us_image_0000001441218249.png + :alt: **Figure 1** Configuring the alarm threshold + + **Figure 1** Configuring the alarm threshold + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network connection is normal.** + +4. .. _alm-12047__li47122569144325: + + Contact the network administrator to check whether the network is normal. + + - If yes, rectify the fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-12047__li52164171144325: + + After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12047__li66824355144325: + + On FusionInsight Manager of the active cluster, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Select **OMS** for **Service** and click **OK**. + +8. Expand the **Hosts** dialog box and select the alarm node and the active OMS node. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..2b6b206 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst @@ -0,0 +1,124 @@ +:original_name: ALM-12048.html + +.. _ALM-12048: + +ALM-12048 Write Packet Error Rate Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the write packet error rate every 30 seconds. This alarm is generated when the write packet error rate exceeds the threshold (the default threshold is **0.5%**) for multiple times (the default value is **5**). + +To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Error Rate**. + +If **Trigger Count** is **1**, this alarm is cleared when the write packet error rate is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the write packet error rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12048 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Port Name | Specifies the network port for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The communication is intermittently interrupted, and services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network quality is poor. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Error Rate**, and check whether the alarm threshold is configured properly. The default value is **0.5%**. You can adjust the threshold as needed. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12048__li15963175145357: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Packet Error Rate**. Click **Modify** in the **Operation** column to change the threshold. + + See :ref:`Figure 1 `. + + .. _alm-12048__fig53221363145357: + + .. figure:: /_static/images/en-us_image_0000001390619040.png + :alt: **Figure 1** Configuring the alarm threshold + + **Figure 1** Configuring the alarm threshold + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network connection is normal.** + +4. .. _alm-12048__li12888339145357: + + Contact the network administrator to check whether the network is normal. + + - If yes, rectify the fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-12048__li60279330145357: + + After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12048__li5643066145357: + + On FusionInsight Manager of the active cluster, choose **O&M** > **Log** > **Download**. + +7. Select **OMS** for **Service** and click **OK**. + +8. Expand the **Hosts** dialog box and select the alarm node and the active OMS node. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12049_network_read_throughput_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12049_network_read_throughput_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..2149848 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12049_network_read_throughput_rate_exceeds_the_threshold.rst @@ -0,0 +1,130 @@ +:original_name: ALM-12049.html + +.. _ALM-12049: + +ALM-12049 Network Read Throughput Rate Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the network read throughput rate every 30 seconds and compares the actual throughput rate with the threshold (the default threshold is 80%). This alarm is generated when the system detects that the network read throughput rate exceeds the threshold for several times (5 times by default) consecutively. + +To change the threshold, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Throughput Rate**. + +When the **Trigger Count** is 1, this alarm is cleared when the network read throughput rate is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the network read throughput rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12049 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service system runs improperly or is unavailable. + +Possible Causes +--------------- + +- The alarm threshold is set improperly. +- The network port rate cannot meet the current service requirements. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. On the FusionInsight Manager, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Throughput Rate** and check whether the alarm threshold is set properly. (By default, 80% is a proper value. However, users can configure the value as required.) + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12049__li5611086815131: + + Based on actual usage condition, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Reading** > **Read Throughput Rate** and click **Modify** in the **Operation** column to modify the alarm threshold. + + For details, see :ref:`Figure 1 `. + + .. _alm-12049__fig566375315131: + + .. figure:: /_static/images/en-us_image_0000001440858201.png + :alt: **Figure 1** Setting alarm thresholds + + **Figure 1** Setting alarm thresholds + +#. Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network port rate can meet the service requirements.** + +4. .. _alm-12049__li3065917315131: + + On FusionInsight Manager, click |image1| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host and the network port name for which the alarm is generated. + +5. Log in to the host for which the alarm is generated as user **root**. + +6. Run the **ethtool** *network port name* command to check the maximum speed of the current network port. + + .. note:: + + In the VM environment, you cannot run a command to query the network port rate. It is recommended that you contact the system administrator to confirm whether the network port rate meets the requirements. + +7. If the network read throughput rate exceeds the threshold, contact the system administrator to increase the network port rate. + +8. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-12049__li4699944215131: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +10. Select **OMS** from the **Service** and click **OK**. + +11. Set **Host** to the node for which the alarm is generated and the active OMS node. + +12. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383872.png +.. |image2| image:: /_static/images/en-us_image_0269383873.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12050_network_write_throughput_rate_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12050_network_write_throughput_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..be268a6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12050_network_write_throughput_rate_exceeds_the_threshold.rst @@ -0,0 +1,130 @@ +:original_name: ALM-12050.html + +.. _ALM-12050: + +ALM-12050 Network Write Throughput Rate Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the network write throughput rate every 30 seconds and compares the actual throughput rate with the threshold (the default threshold is 80%). This alarm is generated when the system detects that the network write throughput rate exceeds the threshold for several times (5 times by default) consecutively. + +To change the threshold, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Throughput Rate**. + +When the **Trigger Count** is 1, this alarm is cleared when the network write throughput rate is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the network write throughput rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12050 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service system runs improperly or is unavailable. + +Possible Causes +--------------- + +- The alarm threshold is set improperly. +- The network port rate cannot meet the current service requirements. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. On the FusionInsight Manager, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Throughput Rate** and check whether the alarm threshold is set properly. (By default, 80% is a proper value. However, users can configure the value as required.) + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12050__li2386220215653: + + Based on actual usage condition, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Writing** > **Write Throughput Rate** and click **Modify** in the **Operation** column to modify the alarm threshold. + + For details, see :ref:`Figure 1 `. + + .. _alm-12050__fig2514972915653: + + .. figure:: /_static/images/en-us_image_0000001440978021.png + :alt: **Figure 1** Setting alarm thresholds + + **Figure 1** Setting alarm thresholds + +#. Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network port rate can meet the service requirements.** + +4. .. _alm-12050__li3034361015653: + + On FusionInsight Manager, click |image1| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host and the network port name for which the alarm is generated. + +5. Log in to the host for which the alarm is generated as user **root**. + +6. Run the **ethtool**\ *network port name* command to check the maximum speed of the current network port. + + .. note:: + + In the VM environment, you cannot run a command to query the network port rate. It is recommended that you contact the system administrator to confirm whether the network port rate meets the requirements. + +7. If the network write throughput rate exceeds the threshold, contact the system administrator to increase the network port rate. + +8. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-12050__li1329206015653: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +10. Select **OMS** from the **Service** and click **OK**. + +11. Set **Host** to the node for which the alarm is generated and the active OMS node. + +12. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383875.png +.. |image2| image:: /_static/images/en-us_image_0269383876.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..45258d8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst @@ -0,0 +1,128 @@ +:original_name: ALM-12051.html + +.. _ALM-12051: + +ALM-12051 Disk Inode Usage Exceeds the Threshold +================================================ + +Description +----------- + +The system checks the disk Inode usage every 30 seconds and compares the actual Inode usage with the threshold (the default threshold is 80%). This alarm is generated when the Inode usage exceeds the threshold for several times (5 times by default) consecutively. + +To change the threshold, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Disk** > **Disk Inode Usage**. + +When the **Trigger Count** is 1, this alarm is cleared when the disk Inode usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the disk Inode usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12051 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| PartitionName | Specifies the disk partition for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data cannot be properly written to the file system. + +Possible Causes +--------------- + +Massive small files are stored in the disk. + +Procedure +--------- + +**Massive small files are stored in the disk.** + +#. On FusionInsight Manager, choose **O&M > Alarm > Alarms** and click |image1| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host and the disk partition for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **df -i \| grep -iE "**\ *partition name\|*\ Filesystem" command to check the current disk Inode usage. + + .. code-block:: + + # df -i | grep -iE "xvda2|Filesystem" + Filesystem Inodes IUsed IFree IUse% Mounted on + /dev/xvda2 2359296 207420 2151876 9% / + +#. If the Inode usage exceeds the threshold, manually check small files stored in the disk partition and confirm whether these small files can be deleted. + + .. note:: + + Run the **for i in /*; do echo $i; find $i|wc -l;** **done** command to query the number of files in a partition. Replace **/\*** with the specified partition. + + .. code-block:: + + # for i in /srv/*; do echo $i; find $i|wc -l; done + /srv/BigData + 4284 + /srv/ftp + 1 + /srv/www + 13 + + - If yes, run the **rm -rf** *Path of the file or folder* to be deleted command to delete the file or folder and go to :ref:`5 `. + + .. note:: + + Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation. + + - If no, expand the capacity. Then, perform :ref:`5 `. + +#. .. _alm-12051__li52275864151050: + + Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12051__li1819875814203: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +7. Select **OMS** from the **Service** and click **OK**. + +8. Set **Host** to the node for which the alarm is generated and the active OMS node. + +9. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383877.png +.. |image2| image:: /_static/images/en-us_image_0269383878.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12052_tcp_temporary_port_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12052_tcp_temporary_port_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..45ba715 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12052_tcp_temporary_port_usage_exceeds_the_threshold.rst @@ -0,0 +1,140 @@ +:original_name: ALM-12052.html + +.. _ALM-12052: + +ALM-12052 TCP Temporary Port Usage Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the TCP temporary port usage every 30 seconds and compares the actual usage with the threshold (the default threshold is 80%). This alarm is generated when the TCP temporary port usage exceeds the threshold for several times (5 times by default) consecutively. + +To change the threshold, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Network Status** > **TCP Ephemeral Port Usage**. + +When the **Trigger Count** is 1, this alarm is cleared when the TCP temporary port usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the TCP temporary port usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12052 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Services on the host cannot establish external connections, and therefore they are interrupted. + +Possible Causes +--------------- + +- The temporary port cannot meet the current service requirements. +- The system is abnormal. + +Procedure +--------- + +**Expand the temporary port number range.** + +#. On FusionInsight Manager, click |image1| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **omm**. + +#. Run the **cat /proc/sys/net/ipv4/ip_local_port_range \|cut -f 1** command to obtain the value of the start port and run the **cat /proc/sys/net/ipv4/ip_local_port_range \|cut -f 2** command to obtain the value of the end port. The total number of temporary ports is the value of the end port minus the value of the start port. If the total number of temporary ports is smaller than 28,232, the random port range of the OS is narrow. Contact the system administrator to increase the port range. + +#. Run the **ss -ant 2>/dev/null \| grep -v LISTEN \| awk 'NR > 2 {print $4}'|cut -d ':' -f 2 \| awk '$1 >"**\ *Value of the start port*\ **" {print $1}' \| sort -u \| wc -l** command to calculate the number of used temporary ports. + +#. The formula for calculating the usage of the temporary ports is: Usage of the temporary ports = (Number of used temporary ports/Total number of temporary ports) x 100%. Check whether the temporary port usage exceeds the threshold. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12052__li61526456151427: + + Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether the system environment is abnormal.** + +7. .. _alm-12052__li39311997145458: + + Run the following command to import the temporary file and view the frequently used ports in the **port_result.txt file**: + + **netstat -tnp\ \|sort > $BIGDATA_HOME/tmp/port_result.txt** + + .. code-block:: + + netstat -tnp|sort + + Active Internet connections (w/o servers) + + Proto Recv Send LocalAddress ForeignAddress State PID/ProgramName tcp 0 0 10-120-85-154:45433 10-120-85-154:9866 CLOSE_WAIT 94237/java + tcp 0 0 10-120-85-154:45434 10-120-85-154:9866 CLOSE_WAIT 94237/java + tcp 0 0 10-120-85-154:45435 10-120-85-154:9866 CLOSE_WAIT 94237/java + ... + +8. Run the following command to view the processes that occupy a large number of ports: + + **ps -ef \|grep** *PID* + + .. note:: + + - PID is the processes ID queried in :ref:`7 `. + + - Run the following command to collect information about all processes and check the processes that occupy a large number of ports: + + **ps -ef > $BIGDATA_HOME/tmp/ps_result.txt** + +9. After obtaining the administrator's approval, clear the processes that occupy a large number of ports. Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Collect fault information.** + +10. .. _alm-12052__li57585220151427: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +11. Select **OMS** from the **Service** and click **OK**. + +12. Set **Host** to the node for which the alarm is generated and the active OMS node. + +13. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected log information and files **port_result.txt** and **ps_result.txt**. Then, delete the two residual temporary files from the environment. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383880.png +.. |image2| image:: /_static/images/en-us_image_0269383881.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12053_host_file_handle_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12053_host_file_handle_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b258969 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12053_host_file_handle_usage_exceeds_the_threshold.rst @@ -0,0 +1,152 @@ +:original_name: ALM-12053.html + +.. _ALM-12053: + +ALM-12053 Host File Handle Usage Exceeds the Threshold +====================================================== + +Description +----------- + +The system checks the file handle usage every 30 seconds and compares the actual usage with the threshold (the default threshold is 80%). This alarm is generated when the host file handle usage exceeds the threshold for several times (5 times by default) consecutively. + +To change the threshold, choose **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Host** > **Host Status** > **Host File Handle Usage**. + +When the **Trigger Count** is 1, this alarm is cleared when the host file handle usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the host file handle usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12053 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The I/O operations, such as opening a file or connecting to network, cannot be performed and programs are abnormal. + +Possible Causes +--------------- + +- The application process is abnormal. For example, the opened file or socket is not closed. +- The number of file handles cannot meet the current service requirements. +- The system is abnormal. + +Procedure +--------- + +**Check information about files opened in processes.** + +#. On FusionInsight Manager, click |image1| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **lsof -n|awk '{print $2}'|sort|uniq -c|sort -nr|more** command to check the process that occupies excessive file handles. + +#. Check whether the processes in which a large number of files are opened are normal. For example, check whether there are files or sockets not closed. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm-12053__li698311306446: + + Release the abnormal processes that occupy too many file handles. + +#. Five minutes later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Increase the number of file handles.** + +7. .. _alm-12053__li50842733151924: + + On FusionInsight Manager, click |image2| in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host for which the alarm is generated. + +8. Log in to the host for which the alarm is generated as user **root**. + +9. .. _alm-12053__li103121715194518: + + Contact the system administrator to increase the number of system file handles. + +10. Run the **cat /proc/sys/fs/file-nr** command to view the used handles and the maximum number of file handles. The first value is the number of used handles, the third value is the maximum number. Please check whether the usage exceeds the threshold. + + - If yes, go to :ref:`9 `. + + - If no, go to :ref:`11 `. + + .. code-block:: + + # cat /proc/sys/fs/file-nr + 12704 0 640000 + +11. .. _alm-12053__li133010151924: + + Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Check whether the system environment is abnormal.** + +12. .. _alm-12053__li21666806151924: + + Contact the system administrator to check whether the operating system is abnormal. + + - If yes, go to :ref:`13 ` to rectify the fault. + - If no, go to :ref:`14 `. + +13. .. _alm-12053__li23370043151924: + + Wait for 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Collect fault information.** + +14. .. _alm-12053__li58218801151924: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download**. + +15. Select **OMS** from the **Service** and click **OK**. + +16. Set **Host** to the node for which the alarm is generated and the active OMS node. + +17. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383882.png +.. |image2| image:: /_static/images/en-us_image_0269383883.png +.. |image3| image:: /_static/images/en-us_image_0269383884.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12054_invalid_certificate_file.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12054_invalid_certificate_file.rst new file mode 100644 index 0000000..38521ab --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12054_invalid_certificate_file.rst @@ -0,0 +1,152 @@ +:original_name: ALM-12054.html + +.. _ALM-12054: + +ALM-12054 Invalid Certificate File +================================== + +Description +----------- + +The system checks whether the certificate file is invalid (has expired or is not valid yet) on 23:00 every day. This alarm is generated when the certificate file is invalid. + +This alarm is cleared when a valid certificate is imported. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12054 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Some functions are unavailable. + +Possible Causes +--------------- + +No certificate (CA certificate, HA root certificate, HA user certificate, Gaussdb root certificate, or Gaussdb user certificate) is imported to the system, the certificate fails to be imported, or the certificate file is invalid. + +Procedure +--------- + +**Check the alarm cause.** + +#. On FusionInsight Manager, locate the target alarm in the real-time alarm list and click |image1|. + + View **Additional Information** to obtain the additional information about the alarm. + + - If **CA Certificate** is displayed in the additional alarm information, log in to the active OMS management node as user **omm** and go to :ref:`2 `. + - If **HA root Certificate** is displayed in the additional information, view **Location** to obtain the name of the host involved in this alarm. Then, log in to the host as user **omm** and go to :ref:`3 `. + - If **HA server Certificate** is displayed in the additional information, view **Location** to obtain the name of the host involved in this alarm. Then, log in to the host as user **omm** and go to :ref:`4 `. + - If **Certificate has expired** is displayed in the additional information, view **Location** to obtain the host name of the node for which the alarm is generated. Then, log in to the host as user **omm** and perform :ref:`2 ` to :ref:`4 ` to check whether the certificates have expired. If these certificates have not expired, check whether other certificates have been imported. If yes, import the certificate files again. + +**Check the validity period of the certificate files in the system.** + +2. .. _alm-12054__li2768003415237: + + Check whether the current system time is in the validity period of the CA certificate. + + Run the **bash ${CONTROLLER_HOME}/security/cert/conf/querycertvalidity.sh** command to check the effective time and due time of the CA root certificate. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`5 `. + +3. .. _alm-12054__li6628516015237: + + Check whether the current system time is in the validity period of the HA root certificate. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt** command to check the effective time and due time of the HA root certificate. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +4. .. _alm-12054__li64457371511: + + Check whether the current system time is in the validity period of the HA user certificate. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt** command to check the effective time and due time of the HA user certificate. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +The following is an example of the effective time and due time of a CA or HA certificate: + +.. code-block:: + + Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 97:d5:0e:84:af:ec:34:d8 + Signature Algorithm: sha256WithRSAEncryption + Issuer: C=CN, ST=xxx, L=yyy, O=zzz, OU=IT, CN=HADOOP.COM + Validity + Not Before: Dec 13 06:38:26 2016 GMT // Effective time + Not After : Dec 11 06:38:26 2026 GMT // Due time + +**Import certificate files.** + +5. .. _alm-12054__li99782015237: + + Import a new CA certificate file. + + Apply for or generate a new CA certificate file and import it to the system. The alarm is automatically cleared after the CA certificate is imported. Check whether this alarm is reported again during periodic check. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +6. .. _alm-12054__li3092985115237: + + Import a new HA certificate file. + + Apply for or generate a new HA certificate file and import it to the system. The alarm is automatically cleared after the CA certificate is imported. Check whether this alarm is reported again during periodic check. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +**Collect the fault information.** + +7. .. _alm-12054__li993320915237: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +8. In the **Services** area, select **Controller**, **OmmServer**, **OmmCore**, and **Tomcat**, and click **OK**. + +9. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895749.png +.. |image2| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst new file mode 100644 index 0000000..7e7f210 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst @@ -0,0 +1,151 @@ +:original_name: ALM-12055.html + +.. _ALM-12055: + +ALM-12055 The Certificate File Is About to Expire +================================================= + +Description +----------- + +The system checks the certificate file on 23:00 every day. This alarm is generated if the certificate file is about to expire within 30 days. + +This alarm is cleared when a certificate that is not about to expire is imported. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12055 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Some functions are unavailable. + +Possible Causes +--------------- + +The remaining validity period of a system certificate (CA certificate, HA root certificate, HA user certificate, Gaussdb root certificate, or Gaussdb user certificate) is less than 30 days. + +Procedure +--------- + +**Check the alarm cause.** + +#. On FusionInsight Manager, locate the target alarm in the real-time alarm list and click |image1|. + + View **Additional Information** to obtain the additional information about the alarm. + + - If **CA Certificate** is displayed in the additional alarm information, log in to the active OMS management node as user **omm** and go to :ref:`2 `. + - If **HA root Certificate** is displayed in the additional information, view **Location** to obtain the name of the host involved in this alarm. Then, log in to the host as user **omm** and go to :ref:`3 `. + - If **HA server Certificate** is displayed in the additional information, view **Location** to obtain the name of the host involved in this alarm. Then, log in to the host as user **omm** and go to :ref:`4 `. + +**Check the validity period of the certificate files in the system.** + +2. .. _alm-12055__li31866665152950: + + Check whether the remaining validity period of the CA certificate is smaller than the alarm threshold. + + Run the **bash ${CONTROLLER_HOME}/security/cert/conf/querycertvalidity.sh** command to check the effective time and due time of the CA root certificate. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +3. .. _alm-12055__li35214520152950: + + Check whether the remaining validity period of the HA root certificate is smaller than the alarm threshold. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt** command to check the effective time and due time of the HA root certificate. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +4. .. _alm-12055__li089064874420: + + Check whether the remaining validity period of the HA user certificate is smaller than the alarm threshold. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt** command to check the effective time and due time of the HA user certificate. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +The following is an example of the effective time and due time of a CA or HA certificate: + +.. code-block:: + + Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 97:d5:0e:84:af:ec:34:d8 + Signature Algorithm: sha256WithRSAEncryption + Issuer: C=CN, ST=xxx, L=yyy, O=zzz, OU=IT, CN=HADOOP.COM + Validity + Not Before: Dec 13 06:38:26 2016 GMT // Effective time + Not After : Dec 11 06:38:26 2026 GMT // Due time + +**Import certificate files.** + +5. .. _alm-12055__li12048984152950: + + Import a new CA certificate file. + + Apply for or generate a new CA certificate file and import it to the system. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +6. .. _alm-12055__li13690164035120: + + Import a new HA certificate file. + + Apply for or generate a new HA certificate file and import it to the system. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +**Collect the fault information.** + +7. .. _alm-12055__li48423894152950: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +8. In the **Services** area, select **Controller**, **OmmServer**, **OmmCore**, and **Tomcat**, and click **OK**. + +9. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895749.png +.. |image2| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12057_metadata_not_configured_with_the_task_to_periodically_back_up_data_to_a_third-party_server.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12057_metadata_not_configured_with_the_task_to_periodically_back_up_data_to_a_third-party_server.rst new file mode 100644 index 0000000..3211f81 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12057_metadata_not_configured_with_the_task_to_periodically_back_up_data_to_a_third-party_server.rst @@ -0,0 +1,84 @@ +:original_name: ALM-12057.html + +.. _ALM-12057: + +ALM-12057 Metadata Not Configured with the Task to Periodically Back Up Data to a Third-Party Server +==================================================================================================== + +Description +----------- + +After the system is installed, it checks whether the task for periodically backing up metadata to the third-party server, and then performs the check hourly. If the task for periodically backing up metadata to a third-party server is not configured, a critical alarm is generated. + +This alarm is cleared when a user creates such a backup task. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12057 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If metadata is not backed up to a third-party server, metadata cannot be restored if both the active and standby management nodes of the cluster are faulty and local backup data is lost. + +Possible Causes +--------------- + +Metadata is not configured with the task to periodically back up data to a third-party server. + +Procedure +--------- + +#. On the FusionInsight Manager portal choose **O&M > Alarm > Alarms**. +#. In the alarm list, click |image1| in the row where the alarm is located and identify the data module from which the alarm is generated based on **Additional Information**. +#. Choose **O&M** > **Backup and Restoration > Backup Management** > **Create**. +#. Configure a backup task. The backup data to be configured is consistent with the data in Additional Information of the alarm. +#. After the backup task is created successfully, wait for two minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information** + +6. .. _alm-12057__li1185962516113: + + On FusionInsight Manager, choose **O&M** > **Log > Download**. + +7. In the **Service** area, select **Controller** and click **OK**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383889.png +.. |image2| image:: /_static/images/en-us_image_0269383890.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12061_process_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12061_process_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..119cbd5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12061_process_usage_exceeds_the_threshold.rst @@ -0,0 +1,140 @@ +:original_name: ALM-12061.html + +.. _ALM-12061: + +ALM-12061 Process Usage Exceeds the Threshold +============================================= + +Description +----------- + +The system checks the usage of the omm process every 30 seconds. Users can run the **ps -o nlwp, pid, args, -u omm \| awk '{sum+=$1} END {print "", sum}'** command to obtain the number of concurrent processes of user **omm**. Run the **ulimit -u**\ command to obtain the maximum number of processes that can be simultaneously opened by user **omm**. Divide the number of concurrent processes by the maximum number to obtain the process usage of user **omm**. The process usage has a default threshold. This alarm is generated when the process usage exceeds the threshold. + +If **Trigger Count** is **3** and the process usage is less than or equal to the threshold, this alarm is cleared. If **Trigger Count** is greater than **1**\ and the process usage is less than or equal to 90% of the threshold, this alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12061 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------+ +| Name | Meaning | ++===================+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- Switch to user **omm** fails. +- New omm process cannot be created. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The maximum number of processes (including threads) that can be concurrently opened by user **omm** is inappropriate. +- An excessive number of threads are opened at the same time. + +Procedure +--------- + +**Check whether the alarm threshold or alarm hit number is properly configured.** + +#. On the FusionInsight Manager, change the alarm threshold and **Trigger Count** based on the actual CPU usage. + + Specifically, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host**> **Process** > **omm Process Usage** to change Trigger Count. + + .. note:: + + The alarm is generated when the process usage exceeds the threshold for the times specified by **Trigger Count**. + + Set the alarm threshold based on the actual process usage. To check the process usage, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Host**> **Process** > **omm Process Usage**, as shown in :ref:`Figure 1 `. + + .. _alm-12061__fig437414238216: + + .. figure:: /_static/images/en-us_image_0000001440858217.png + :alt: **Figure 1** Setting an alarm threshold + + **Figure 1** Setting an alarm threshold + +#. 2 minutes later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`3 `. + +**Check whether the maximum number of processes (including threads) opened by user omm is appropriate.** + +3. .. _alm-12061__li936717234216: + + In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host for which the alarm is generated. + +4. Log in to the host where the alarm is generated as user **root**. + +5. Run the **su - omm** command to switch to user **omm**. + +6. Run the **ulimit -u** command to obtain the maximum number of threads that can be concurrently opened by user **omm** and check whether the number is greater than or equal to 60000. + + - If it is, go to :ref:`8 `. + - If it is not, go to :ref:`7 `. + +7. .. _alm-12061__li8367152314217: + + Run the **ulimit -u 60000** command to change the maximum number to 60000. Two minutes later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`12 `. + +**Check whether an excessive number of processes are opened at the same time.** + +8. .. _alm-12061__li293443912213: + + In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host for which the alarm is generated. + +9. Log in to the host where the alarm is generated as user **root**. + +10. Run the **ps -o nlwp, pid, lwp, args, -u omm|sort -n** command to check the numbers of threads used by the system. The result is sorted based on the thread number. Analyze the top 5 thread numbers and check whether the threads are incorrectly used. If they are, contact maintenance personnel to rectify the fault. If they are not, run the **ulimit -u** command to change the maximum number to be greater than 60000. + +11. Five minutes later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`12 `. + +**Collect fault information.** + +12. .. _alm-12061__li1668345092117: + + On the FusionInsight Manager home page of the active clusters, choose **O&M** > **Log** > **Download**. + +13. Select **OmmServer** and **NodeAgent** from the **Service** and click **OK**. + +14. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +15. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383906.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12062_oms_parameter_configurations_mismatch_with_the_cluster_scale.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12062_oms_parameter_configurations_mismatch_with_the_cluster_scale.rst new file mode 100644 index 0000000..f1aac84 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12062_oms_parameter_configurations_mismatch_with_the_cluster_scale.rst @@ -0,0 +1,129 @@ +:original_name: ALM-12062.html + +.. _ALM-12062: + +ALM-12062 OMS Parameter Configurations Mismatch with the Cluster Scale +====================================================================== + +Description +----------- + +The system checks whether the OMS parameter configurations match with the cluster scale at each top hour. If the OMS parameter configurations do not meet the cluster scale requirements, the system generates this alarm. This alarm is automatically cleared when the OMS parameter configurations are modified. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12062 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+---------------------------------------------------------------------+ +| Parameter | Description | ++=============+=====================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| ServiceName | Specifies the name of the service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The OMS configuration is not modified when the cluster is installed or the system capacity is expanded. + +Possible Causes +--------------- + +The OMS parameter configurations mismatch with the cluster scale. + +Procedure +--------- + +**Check whether the OMS parameter configurations match with the cluster scale.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host for which the alarm is generated. +#. Log in to the host where the alarm is generated as user **root**. +#. Run the **su - omm** command to switch to user **omm**. +#. Run the **vi $BIGDATA_LOG_HOME/controller/scriptlog/modify_manager_param.log** command to open the log file and search for the log file containing the following information: Current oms configurations cannot support *xx* nodes. In the information, *xx* indicates the number of nodes in the cluster. +#. Optimize the current cluster configuration by following the instructions in :ref:`Optimizing Manager Configurations Based on the Number of Cluster Nodes `. +#. One hour later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-12062__li8140111212587: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Select **Controller** from the **Service** and click **OK**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +.. _alm-12062__section117861721171717: + +Related Information +------------------- + +**Optimizing Manager Configurations Based on the Number of Cluster Nodes** + +#. Log in to the active Manager node as user **omm**. + +#. Run the following command to switch the directory: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + +#. Run the following command to view the current Manager configurations. + + **sh oms_config_info.sh -q** + +#. Run the following command to specify the number of nodes in the current cluster. + + Command format: **sh oms_config_info.sh -s** *number of nodes* + + Example: + + **sh oms_config_info.sh -s 1000** + + Enter **y** as prompted. + + .. code-block:: + + The following configurations will be modified: + Module Parameter Current Target + Controller controller.Xmx 4096m => 16384m + Controller controller.Xms 1024m => 8192m + Controller controller.node.heartbeat.error.threshold 30000 => 60000 + Pms pms.mem 8192m => 10240m + Do you really want to do this operation? (y/n): + + The configurations are updated successfully if the following information is displayed: + + .. code-block:: + + ... + Operation has been completed. Now restarting OMS server. [done] + Restarted oms server successfully. + + .. note:: + + - OMS is automatically restarted during the configuration update process. + - Clusters with similar quantities of nodes have same Manager configurations. For example, when the number of nodes is changed from 100 to 101, no configuration item needs to be updated. + +.. |image1| image:: /_static/images/en-us_image_0269383907.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12063_unavailable_disk.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12063_unavailable_disk.rst new file mode 100644 index 0000000..3612a84 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12063_unavailable_disk.rst @@ -0,0 +1,106 @@ +:original_name: ALM-12063.html + +.. _ALM-12063: + +ALM-12063 Unavailable Disk +========================== + +Description +----------- + +The system checks whether the data disk of the current host is available at the top of each hour. The system creates files, writes files, and deletes files in the mount directory of the disk. If the operations fail, the alarm is generated. If the operations succeed, the disk is available, and the alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12063 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+---------------------------------------------------------------------+ +| Parameter | Description | ++=============+=====================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| ServiceName | Specifies the name of the service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| DiskName | Specifies the disk for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data read or write on the data disk fails, and services are abnormal. + +Possible Causes +--------------- + +- The permission of the disk mount directory is abnormal. +- There are disk bad sectors. + +Procedure +--------- + +**Check whether the permission of the disk mount directory is normal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host and **DiskName** for the disk for which the alarm is generated. +#. Log in to the host where the alarm is generated as user **root**. +#. Run the **df -h \|grep DiskName** command to obtain the mount point and check whether the permission of the mount directory is unwritable or unreadable. + + - If it is, go to :ref:`4 `. + - If it is not, go to :ref:`8 `. + + .. note:: + + If the permission of the mount directory is 000 or the owner is **root**, the mount directory is unreadable and unwritable. + +4. .. _alm-12063__li1053537184512: + + Modify the directory permission. + +5. One hour later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +6. .. _alm-12063__li4535871458: + + Contact hardware engineers to rectify the disk. + +7. One hour later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-12063__li8140111212587: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +9. Select **NodeAgent** from the **Service** and click **OK**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383908.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12064_host_random_port_range_conflicts_with_cluster_used_port.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12064_host_random_port_range_conflicts_with_cluster_used_port.rst new file mode 100644 index 0000000..14fa277 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12064_host_random_port_range_conflicts_with_cluster_used_port.rst @@ -0,0 +1,94 @@ +:original_name: ALM-12064.html + +.. _ALM-12064: + +ALM-12064 Host Random Port Range Conflicts with Cluster Used Port +================================================================= + +Alarm Description +----------------- + +The system checks whether the random port range of the host conflicts with the range of ports used by the Cluster system every hour. The alarm is generated if they conflict. The alarm is automatically cleared when the random port range of the host is changed to the normal range. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12064 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+---------------------------------------------------------------------+ +| Parameter | Description | ++=============+=====================================================================+ +| Source | Specifiestheclusterorsystemforwhichthealarmisgenerated. | ++-------------+---------------------------------------------------------------------+ +| ServiceName | Specifies the name of the service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The default port of the Cluster system is occupied. As a result, some processes fail to be started. + +Possible Causes +--------------- + +The random port range configuration is modified. + +Procedure +--------- + +**Check the random port range of the system.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host for which the alarm is generated. + +#. Log in to the host where the alarm is generated as user **root**. + +#. Run the **cat /proc/sys/net/ipv4/ip_local_port_range** command to obtain the random port range of the host and check whether the minimum value is smaller than 32768. + + - If it is, go to :ref:`4 `. + - If it is not, goto :ref:`7 `. + +#. .. _alm-12064__li1796713455375: + + Run the **vim /etc/sysctl.conf** command to change the value of **net.ipv4.ip_local_port_range** to **32768 61000**. If this parameter does not exist, add the following configuration: **net.ipv4.ip_local_port_range = 32768 61000**. + +#. Run the **sysctl -p /etc/sysctl.conf** command for the modification to take effect. + +#. One hour later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-12064__li1396704514377: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Select **NodeAgent** for **Service** and click **OK**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383909.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12066_trust_relationships_between_nodes_become_invalid.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12066_trust_relationships_between_nodes_become_invalid.rst new file mode 100644 index 0000000..566a6f8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12066_trust_relationships_between_nodes_become_invalid.rst @@ -0,0 +1,155 @@ +:original_name: ALM-12066.html + +.. _ALM-12066: + +ALM-12066 Trust Relationships Between Nodes Become Invalid +========================================================== + +Description +----------- + +The system checks whether the trust relationship between the active OMS node and other Agent nodes is normal every hour. The alarm is generated if the mutual trust fails. This alarm is automatically cleared if this problem is resolved. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12066 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Some operations on the management plane may be abnormal. + +Possible Causes +--------------- + +- The **/etc/ssh/sshd_config** configuration file is damaged. +- The password of user **omm** has expired. + +Procedure +--------- + +**Check the status of the /etc/ssh/sshd_config configuration file.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm and click |image1| to view the host list in the alarm details. + +#. Log in to the active OMS node as user **omm**. + +#. Run the **ssh** command, for example, **ssh** **host2**, on each node in the alarm details to check whether the connection fails. (**host2** is a node other than the OMS node in the alarm details.) + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-12066__li176321676280: + + Open the **/etc/ssh/sshd_config** configuration file on host2 and check whether **AllowUsers** or **DenyUsers** is configured for other nodes. + + - If yes, go to :ref:`5 `. + - If no, contact OS experts. + +#. .. _alm-12066__li846318425575: + + Modify the whitelist or blacklist to ensure that user **omm** is in the whitelist or not in the blacklist. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check the status of the password of user omm.** + +6. .. _alm-12066__li9148131091317: + + Check the interaction information of the **ssh** command. + + - If the password of user **omm** is required, go to :ref:`7 `. + - If message "Enter passphrase for key '/home/omm/.ssh/id_rsa':" is displayed, go to :ref:`9 `. + +7. .. _alm-12066__li81482101138: + + Check the trust list (**/home/omm/.ssh/authorized_keys**) of user **omm** on the OMS node and host2 node. Check whether the trust list contains the public key file (**/home/omm/.ssh/id_rsa.pub**) of user **omm** on the peer host. + + - If yes, contact OS experts. + - If no, add the public key of user **omm** of the peer host to the trust list of the local host. + +8. Add the public key of user **omm** of the peer host to the trust list of the local host. Run the **ssh** command, for example, **ssh host2**, on each node in the alarm details to check whether the connection fails. (**host2** is a node other than the OMS node in the alarm details.) + + - If yes, go to :ref:`9 `. + - If no, check whether the alarm is cleared. If the alarm is cleared, no further action is required; otherwise, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-12066__li106306742813: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Select **Controller** for **Service** and click **OK**. + +11. Click |image2| in the upper right corner to set the log collection time range. Generally, the time range is 10 minutes before and after the alarm generation time. Click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +Perform the following steps to handle abnormal trust relationships between nodes: + +.. important:: + + - Perform this operation as user **omm**. + - If the network between nodes is disconnected, rectify the network fault first. Check whether the two nodes are connected to the same security group and whether **hosts.deny** and **hosts.allow** are set. + +#. Run the **ssh-add -l** command on both nodes to check whether any identities exist. + + |image3| + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-12066__li16978123275815: + + If no identities are displayed, run the **ps -ef|grep ssh-agent** command to find the **ssh-agent** process, stop the process, and wait for the process to automatically restart. + + |image4| + +#. Run the **ssh-add -l** command to check whether the identities have been added. If yes, manually run the **ssh** command to check whether the trust relationship is normal. + + |image5| + +#. .. _alm-12066__li09782325586: + + If identities exist, check whether the **/home/omm/.ssh/authorized_keys** file contains the information in the **/home/omm/.ssh/id_rsa.pub** file of the peer node. If it does not, manually add the information. + +#. Check whether the permissions on the files in the **/home/omm/.ssh** directory are modified. + +#. Check the **/var/log/Bigdata/nodeagent/scriptlog/ssh-agent-monitor.log** file. + +#. If the **/home** directory of user **omm** is deleted, contact MRS support personnel for assistance. + +.. |image1| image:: /_static/images/en-us_image_0263895789.png +.. |image2| image:: /_static/images/en-us_image_0263895540.png +.. |image3| image:: /_static/images/en-us_image_0000001226576418.png +.. |image4| image:: /_static/images/en-us_image_0000001227056330.png +.. |image5| image:: /_static/images/en-us_image_0000001271536445.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12067_tomcat_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12067_tomcat_resource_is_abnormal.rst new file mode 100644 index 0000000..1bf8852 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12067_tomcat_resource_is_abnormal.rst @@ -0,0 +1,89 @@ +:original_name: ALM-12067.html + +.. _ALM-12067: + +ALM-12067 Tomcat Resource Is Abnormal +===================================== + +Description +----------- + +HA checks the Tomcat resources of Manager every 85 seconds. This alarm is generated when HA detects that the Tomcat resources are abnormal for two consecutive times. + +This alarm is cleared when HA detects that the Tomcat resources become normal. + +**Resource Type** of Tomcat is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new Tomcat resources have been enabled on the new active Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby Manager switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12067 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby Manager switchover occurs. +- The Tomcat process repeatedly restarts. + +Possible Causes +--------------- + +- The Tomcat directory permission is abnormal, and the Tomcat process is abnormal. + +Procedure +--------- + +**Check whether the permission on the Tomcat directory is normal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and click |image1| to view the IP address of the host for which the alarm is generated. +#. Log in to the alarm host as user **root**. +#. Run the **su - omm** command to switch to user **omm**. +#. Run the **vi $BIGDATA_LOG_HOME/omm/oms/ha/scriptlog/tomcat.log** command to check whether the Tomcat resource log contains keyword **Cannot find XXX** and rectify the file permission based on the keyword. +#. After 5 minutes, check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12067__li711211264288: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. In the **Services** area, select **OmmServer** and **Tomcat**, and click **OK**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895412.png +.. |image2| image:: /_static/images/en-us_image_0263895407.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12068_acs_resource_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12068_acs_resource_exception.rst new file mode 100644 index 0000000..8c244ff --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12068_acs_resource_exception.rst @@ -0,0 +1,98 @@ +:original_name: ALM-12068.html + +.. _ALM-12068: + +ALM-12068 ACS Resource Exception +================================ + +Description +----------- + +HA checks the ACS resources of Manager every 80 seconds. This alarm is generated when HA detects that the ACS resources are abnormal for two consecutive times. + +This alarm is cleared when HA detects that the ACS resources are normal. + +**Resource Type** of ACS is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new ACS resources have been enabled on the new active Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby Manager switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12068 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby Manager switchover occurs. +- The ACS process repeatedly restarts, which may cause the FusionInsight Manager login failure. + +Possible Causes +--------------- + +The ACS process is abnormal. + +Procedure +--------- + +**Check whether the ACS process is normal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and click |image1| to view the name of the host for which the alarm is generated. + +#. Log in to the alarm host as user **root**. + +#. Run the **su - omm** command and then **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** to check whether the status of the ACS resources managed by the HA is normal. In the single-node system, the ACS resource is in the normal state. In the dual-node system, the ACS resource is in the normal state on the active node and in the stopped state on the standby node. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12068__li139657016249: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/ha/scriptlog/acs.log** command to check whether the ACS resource log of HA contains the keyword **ERROR**. If yes, analyze the logs to locate the resource exception cause and fix the exception. + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12068__li6152360163635: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. In the **Services** area, select **Controller** and **OmmServer**, and click **OK**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895733.png +.. |image2| image:: /_static/images/en-us_image_0263895594.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12069_aos_resource_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12069_aos_resource_exception.rst new file mode 100644 index 0000000..340a8d1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12069_aos_resource_exception.rst @@ -0,0 +1,98 @@ +:original_name: ALM-12069.html + +.. _ALM-12069: + +ALM-12069 AOS Resource Exception +================================ + +Description +----------- + +HA checks the AOS resources of Manager every 81 seconds. This alarm is generated when HA detects that the AOS resources are abnormal for two consecutive times. + +This alarm is cleared when HA detects that the AOS resources become normal. + +**Resource Type** of AOS is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new AOS resources have been enabled on the new active Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby Manager switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12069 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby Manager switchover occurs. +- The AOS process repeatedly restarts, which may cause the FusionInsight Manager login failure. + +Possible Causes +--------------- + +The AOS process is abnormal. + +Procedure +--------- + +**Check whether the AOS process is normal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and click |image1| to view the name of the host for which the alarm is generated. + +#. Log in to the alarm host as user **root**. + +#. Run the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the AOS resources managed by the HA is normal. In the single-node system, the AOS resource is in the normal state. In the dual-node system, the AOS resource is in the normal state on the active node and in the stopped state on the standby node. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-12069__li139657016249: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/ha/scriptlog/aos.log** command to check whether the AOS resource log of HA contains the keyword **ERROR**. If yes, analyze the logs to locate the resource exception cause and fix the exception. + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-12069__li6152360163635: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. In the **Services** area, select **Controller** and **OmmServer**, and click **OK**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895369.png +.. |image2| image:: /_static/images/en-us_image_0263895883.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12070_controller_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12070_controller_resource_is_abnormal.rst new file mode 100644 index 0000000..10928f9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12070_controller_resource_is_abnormal.rst @@ -0,0 +1,97 @@ +:original_name: ALM-12070.html + +.. _ALM-12070: + +ALM-12070 Controller Resource Is Abnormal +========================================= + +Alarm Description +----------------- + +HA checks the controller resources of Manager every 80 seconds. This alarm is generated when HA detects that the controller resources are abnormal for 2 consecutive times. + +This alarm is cleared when the Controller resource is normal. + +**Resource Type** of Controller is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new Controller resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12070 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+---------------------------------------------------------------------+ +| Parameter | Description | ++=============+=====================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| ServiceName | Specifies the name of the service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The Controller process repeatedly restarts, which may cause the FusionInsight Manager login failure. + +Possible Causes +--------------- + +The Controller process is abnormal. + +Procedure +--------- + +**Check whether the controller process is normal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the name of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **su - omm** command to switch to user **omm**.Run the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the Controller resources managed by the HA is normal. In the single-node system, the Controller resource is in the normal state. In the dual-node system, the Controller resource is in the normal state on the active node and in the stopped state on the standby node. + + - If it is, go to :ref:`6 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-12070__li6903202312318: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/ha/scriptlog/controller.log** command to view the Controller resource logs, and run the **vi $BIGDATA_LOG_HOME/controller/controller.log** command to view the Controller running logs, check whether the keyword **ERROR** exists. Analyze the logs to locate and rectify the fault. + +#. Five minutes later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12070__li69038231234: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +7. Select **Controller** and **OmmServe** for **Service** and click **OK**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour before and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383915.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12071_httpd_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12071_httpd_resource_is_abnormal.rst new file mode 100644 index 0000000..74c61bd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12071_httpd_resource_is_abnormal.rst @@ -0,0 +1,99 @@ +:original_name: ALM-12071.html + +.. _ALM-12071: + +ALM-12071 Httpd Resource Is Abnormal +==================================== + +Description +----------- + +HA checks the httpd resources of Manager every 120 seconds. This alarm is generated when HA detects that the httpd resources are abnormal for 10 consecutive times. + +This alarm is cleared when the httpd resource is normal. + +**Resource Type** of httpd is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new httpd resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12071 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The httpd process is repeatedly restarts, which may lead to the failure to visit the native service UI. + +Possible Causes +--------------- + +The httpd process is abnormal. + +Procedure +--------- + +**Check whether the httpd process is abnormal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the name of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **su - omm** command to switch to user **omm**. + +#. Run the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the httpd resources managed by the HA is normal. In the single-node system, the httpd resource is in the normal state. In the dual-node system, the httpd resource is in the normal state on the active node and in the stopped state on the standby node. + + - If it is, go to :ref:`7 `. + - If it is not, go to :ref:`5 `. + +#. .. _alm-12071__li584395101819: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/ha/scriptlog/httpd.log** command to view the httpd resource logs, check whether the keyword **ERROR** exists. Analyze the logs to locate and rectify the fault. + +#. Five minutes later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-12071__li384145118188: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Select **Controller** and **OmmServer** for **Service** and click **OK**. + +9. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 1 hour before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383916.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12072_floatip_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12072_floatip_resource_is_abnormal.rst new file mode 100644 index 0000000..3fed9d1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12072_floatip_resource_is_abnormal.rst @@ -0,0 +1,116 @@ +:original_name: ALM-12072.html + +.. _ALM-12072: + +ALM-12072 FloatIP Resource Is Abnormal +====================================== + +Description +----------- + +HA checks the floatip resources of Manager every 9 seconds. This alarm is generated when HA detects that the floatip resources are abnormal for 3 consecutive times. + +This alarm is cleared when the FloatIP resource is normal. + +**Resource Type** of FloatIP is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new FloatIP resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12072 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The FloatIP process is repeatedly restarts, which may lead to the failure to visit the native service UI. + +Possible Causes +--------------- + +- The floating IP address is abnormal. + +Procedure +--------- + +**Check the floating IP address status of the active management node.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the address of the host for which the alarm is generated and the resource name. + +#. Log in to the active management node as user **root**. + +#. Run the following command, go to the **${BIGDATA_HOME}/om-server/om/sbin/** directory. + + **su - omm** + + **cd** **${BIGDATA_HOME}/om-server/om/sbin/** + +#. Run the **sh status-oms.sh** command, and execute the **status-oms.sh** script to check whether the floating IP address of the active FusionInsight Manager is normal. View the command output, locate the row where **ResName** is **floatip**, and check whether the following information is displayed. + + For example: + + .. code-block:: + + 10-10-10-160 floatip Normal Normal Single_active + + - If it is, go to :ref:`8 `. + - If it is not, go to :ref:`5 `. + +#. .. _alm-12072__li162681212172: + + Run the **ifconfig** command to check whether the NIC with the floating IP address exists. + + - If it does, go to :ref:`8 `. + - If it does not, go to :ref:`6 `. + +#. .. _alm-12072__li19269111111714: + + Run the **ifconfig** *NIC name Floating IPaddress* netmask *Subnet mask* command to reconfigure the NIC with the floating IP address. (For example, **ifconfig eth0 10.10.10.102 netmask 255.255.255.0**). + +#. Five minutes later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-12072__li726861151715: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +9. Select **Controller** and **OmmServer** for **Service** and click **OK**. + +10. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 1 hour before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +11. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383917.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12073_cep_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12073_cep_resource_is_abnormal.rst new file mode 100644 index 0000000..578cc8a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12073_cep_resource_is_abnormal.rst @@ -0,0 +1,97 @@ +:original_name: ALM-12073.html + +.. _ALM-12073: + +ALM-12073 CEP Resource Is Abnormal +================================== + +Description +----------- + +HA checks the cep resources of Manager every 60 seconds. This alarm is generated when HA detects that the cep resources are abnormal for 2 consecutive times. + +This alarm is cleared when the CEP resource is normal. + +**Resource Type** of CEP is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new CEP resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12073 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The CEP process repeatedly restarts, causing monitoring data to be abnormal. + +Possible Causes +--------------- + +The CEP process is abnormal. + +Procedure +--------- + +**Check whether the CEP process is abnormal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the name of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **su -omm** command and then the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the CEP resources managed by the HA is normal. In the single-node system, the CEP resource is in the normal state. In the dual-node system, the CEP resource is in the normal state on the active node and in the stopped state on the standby node. + + - If it is, go to :ref:`6 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-12073__li8262123151618: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/cep/cep.log** and **vi $BIGDATA_LOG_HOME/omm/oms/cep/scriptlog/cep_ha.log** commands to view the CEP resource logs, check whether the keyword **ERROR** exists. Analyze the logs to locate and rectify the fault. + +#. Five minutes later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12073__li9258163110165: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +7. Select **Controller** and **OmmServer** for **Service** and click **OK**. + +8. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 1 hour before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383918.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12074_fms_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12074_fms_resource_is_abnormal.rst new file mode 100644 index 0000000..4f28c4c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12074_fms_resource_is_abnormal.rst @@ -0,0 +1,97 @@ +:original_name: ALM-12074.html + +.. _ALM-12074: + +ALM-12074 FMS Resource Is Abnormal +================================== + +Description +----------- + +HA checks the fms resources of Manager every 60 seconds. This alarm is generated when HA detects that the fms resources are abnormal for 2 consecutive times. + +This alarm is cleared when the FMS resource is normal. + +**Resource Type** of FMS is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new FMS resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12074 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The FMS process repeatedly restarts. As a result, alarm information may fail to be reported. + +Possible Causes +--------------- + +The FMS process is abnormal. + +Procedure +--------- + +**Check whether the FMS process is abnormal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the name of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **su -omm** command and then the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the FMS resources managed by the HA is normal. In the single-node system, the FMS resource is in the normal state. In the dual-node system, the FMS resource is in the normal state on the active node and in the stopped state on the standby node. + + - If it is, go to :ref:`6 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-12074__li1183383931416: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/fms/fms.log** and **vi $BIGDATA_LOG_HOME/omm/oms/fms/scriptlog/fms_ha.log** commands to view the FMS resource logs, check whether the keyword **ERROR** exists. Analyze the logs to locate and rectify the fault. + +#. 5 minutes later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12074__li5828173931412: + + On FusionInsight Manager, choose **O&M**> **Log** > **Download**. + +7. Select **Controller** and **OmmServer** for **Service** and click **OK**. + +8. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 1 hour before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383919.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12075_pms_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12075_pms_resource_is_abnormal.rst new file mode 100644 index 0000000..9007b51 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12075_pms_resource_is_abnormal.rst @@ -0,0 +1,97 @@ +:original_name: ALM-12075.html + +.. _ALM-12075: + +ALM-12075 PMS Resource Is Abnormal +================================== + +Description +----------- + +HA checks the pms resources of Manager every 55 seconds. This alarm is generated when HA detects that the pms resources are abnormal for three consecutive times. + +This alarm is cleared when the PMS resource is normal. + +**Resource Type** of PMS is **Single-active**. Active/standby will be triggered upon resource exceptions. When this alarm is generated, the active/standby switchover is complete and new PMS resources have been enabled on the new active FusionInsight Manager. In this case, this alarm is cleared. This alarm is used to notify users of the cause of the active/standby switchover. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12075 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The active/standby FusionInsight Manager switchover occurs. +- The PMS process repeatedly restarts, causing monitoring information to be abnormal. + +Possible Causes +--------------- + +The PMS process is abnormal. + +Procedure +--------- + +**Check whether the PMS process is abnormal.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the name of the host for which the alarm is generated. + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the **su -omm** command and then the **sh ${BIGDATA_HOME}/om-server/OMS/workspace0/ha/module/hacom/script/status_ha.sh** command to check whether the status of the PMS resources managed by the HA is normal. In the single-node system, the PMS resource is in the normal state. In the dual-node system, the PMS resource is in the normal state on the active node and in the stopped state on the standby node. + + - If it is, go to :ref:`6 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-12075__li1288412199129: + + Run the **vi $BIGDATA_LOG_HOME/omm/oms/pms/pms.log** and **vi $BIGDATA_LOG_HOME/omm/oms/pms/scriptlog/pms_ha.log** commands to view the PMS resource logs, check whether the keyword **ERROR** exists. Analyze the logs to locate and rectify the fault. + +#. Five minutes later, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12075__li11878219121215: + + On FusionInsight Manager, choose **O&M**> **Log** > **Download**. + +7. Select **Controller** and **OmmServer** for **Service** and click **OK**. + +8. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 1 hour before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383920.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12076_gaussdb_resource_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12076_gaussdb_resource_is_abnormal.rst new file mode 100644 index 0000000..5e62eee --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12076_gaussdb_resource_is_abnormal.rst @@ -0,0 +1,133 @@ +:original_name: ALM-12076.html + +.. _ALM-12076: + +ALM-12076 GaussDB Resource Is Abnormal +====================================== + +Description +----------- + +HA checks the Manager database every 10 seconds. This alarm is generated when HA detects that the database is abnormal for 3 consecutive times. + +This alarm is cleared when the database is normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12076 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If databases are abnormal, all core services and related service processes, such as alarms and monitoring functions, are affected. + +Possible Causes +--------------- + +An exception occurs in the database. + +Procedure +--------- + +**Check the database status of the active and standby management nodes.** + +#. Log in to the active and standby management nodes respectively as user **root**. Run the **su - ommdba** command to switch to user **ommdba**, and then run the **gs_ctl query** command to check whether the following information is displayed in the command output. + + Command output of the active management node: + + .. code-block:: + + Ha state: + LOCAL_ROLE: Primary + STATIC_CONNECTIONS : 1 + DB_STATE : Normal + DETAIL_INFORMATION : user/password invalid + Senders info: + No information + Receiver info: + No information + + Command output of the standby management node: + + .. code-block:: + + Ha state: + LOCAL_ROLE: Standby + STATIC_CONNECTIONS : 1 + DB_STATE : Normal + DETAIL_INFORMATION : user/password invalid + Senders info: + No information + Receiver info: + No information + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`2 `. + +#. .. _alm-12076__li1051911355122: + + Contact the network administrator to check whether the network is faulty. + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`5 `. + +#. .. _alm-12076__li251973518126: + + Five minutes later, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +#. .. _alm-12076__li85203358122: + + Log in to the active and standby management nodes, run the **su -omm** command to switch to user **omm**, go to the **${BIGDATA_HOME} /om-server/om/sbin/** directory, and run the **status-oms.sh** script to check whether the floating IP addresses and GaussDB resources of the active and standby FusionInsight Managers are in the status shown in the following figure. + + |image1| + + - If they are, find the alarm in the alarm list and manually clear the alarm. + - If they are not, go to :ref:`5 `. + +**Collect fault information.** + +5. .. _alm-12076__li151723519124: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +6. Select **OmmServer** for **Service** and click **OK**. + +7. Click |image2| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +8. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383921.jpg +.. |image2| image:: /_static/images/en-us_image_0269383922.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12077_user_omm_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12077_user_omm_expired.rst new file mode 100644 index 0000000..c71367a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12077_user_omm_expired.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12077.html + +.. _ALM-12077: + +ALM-12077 User omm Expired +========================== + +Description +----------- + +The system starts at 00:00 every day to check whether user **omm** has expired every eight hours. This alarm is generated if the user account has expired. + +This alarm is cleared when the expiration time of user **omm** is changed and the user account status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12077 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +User **omm** has expired. The node trust relationship is unavailable, and FusionInsight Manager cannot manage the services. + +Possible Causes +--------------- + +User **omm** has expired. + +Procedure +--------- + +**Check whether user omm in the system has expired.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l omm**\ command to view the information about the password of user **omm**. + +#. View the value of **Account expires** to check whether the user configurations have expired. + + .. note:: + + If the parameter value is **never**, the user configurations never expire. + + - If they do, go to :ref:`3 `. + - If they do not, go to :ref:`4 `. + +#. .. _alm-12077__li20789183613915: + + Run the **chage -E 'yyyy-MM-dd' omm** command to set the expiration time of user **omm**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12077__li877819366912: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383923.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12078_password_of_user_omm_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12078_password_of_user_omm_expired.rst new file mode 100644 index 0000000..a7a224c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12078_password_of_user_omm_expired.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12078.html + +.. _ALM-12078: + +ALM-12078 Password of User omm Expired +====================================== + +Description +----------- + +The system starts at 00:00 every day to check whether the password of user **omm** has expired every 8 hours. This alarm is generated if the password has expired. + +This alarm is cleared when the expiration time of user **omm** password is changed and the user password status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12078 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The password of user **omm** has expired. The node trust relationship is unavailable, and FusionInsight Manager cannot manage the services. + +Possible Causes +--------------- + +The password of user **omm** has expired. + +Procedure +--------- + +**Check whether the password of user omm in the system has expired.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l omm**\ command to view the information about the password of user **omm**. + +#. View the value of **Password expires** to check whether the user configurations have expired. + + .. note:: + + If the parameter value is **never**, the user configurations never expire. + + - If they do, go to :ref:`3 `. + - If they do not, go to :ref:`4 `. + +#. .. _alm-12078__li1029433611616: + + Run the **chage -M '**\ *days*\ **' omm** command to set the validity period of the password for user **omm**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12078__li14287143618614: + + On FusionInsight Manager, choose **O&M**> **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383924.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12079_user_omm_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12079_user_omm_is_about_to_expire.rst new file mode 100644 index 0000000..7445459 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12079_user_omm_is_about_to_expire.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12079.html + +.. _ALM-12079: + +ALM-12079 User omm Is About to Expire +===================================== + +Description +----------- + +The system starts at 00:00 every day to check whether user **omm** is about to expire every 8 hours. This alarm is generated if the user account will expire no less than 15 days later. + +This alarm is cleared when the expiration time of user **omm** is changed and the user account status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12079 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +User **omm** has expired. The node trust relationship is unavailable, and FusionInsight Manager cannot manage the services. + +Possible Causes +--------------- + +The account of user **omm** is about to expire. + +Procedure +--------- + +**Check whether user omm is about to expire.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l omm** command to view the information about the password of user **omm**. + +#. View the value of **Account expires** to check whether the user configurations are about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If they are, go to :ref:`3 `. + - If they are not, go to :ref:`4 `. + +#. .. _alm-12079__li152131612669: + + Run the **chage -E** *'yyyy-MM-dd'* **omm** command to set the validity period of user **omm**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12079__li152108126616: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383925.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12080_password_of_user_omm_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12080_password_of_user_omm_is_about_to_expire.rst new file mode 100644 index 0000000..834c862 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12080_password_of_user_omm_is_about_to_expire.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12080.html + +.. _ALM-12080: + +ALM-12080 Password of User omm Is About to Expire +================================================= + +Description +----------- + +The system starts at 00:00 every day to check whether the password of user **omm** is about to expire every 8 hours. This alarm is generated if the password will expire no less than 15 days later. + +This alarm is cleared when the expiration time of user **omm** password is reset and the user password status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12080 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The password of user **omm** has expired. The node trust relationship is unavailable, and FusionInsight Manager cannot manage the services. + +Possible Causes +--------------- + +The password of user **omm** is about to expire. + +Procedure +--------- + +**Check whether the password of user omm in the system is about to expire.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l omm**\ command to view the information about the password of user **omm**. + +#. View the value of **Password expires** to check whether the user configurations are about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If they are, go to :ref:`3 `. + - If they are not, go to :ref:`4 `. + +#. .. _alm-12080__li29331339412: + + Run the **chage -M '**\ *days*\ **' omm** command to set the validity period of the password for user **omm**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12080__li29311039413: + + On FusionInsight Manager, choose **O&M**> **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383926.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12081user_ommdba_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12081user_ommdba_expired.rst new file mode 100644 index 0000000..5c5506a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12081user_ommdba_expired.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12081.html + +.. _ALM-12081: + +ALM-12081 User ommdba Expired +============================== + +Description +----------- + +The system starts at 00:00 every day to check whether user **ommdba** has expired every 8 hours. This alarm is generated if the user account has expired. + +This alarm is cleared when the expiration time of user **ommdba** is reset and the user account status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12081 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The OMS database cannot be managed and data cannot be accessed. + +Possible Causes +--------------- + +The account of user **ommdba** for the host has expired. + +Procedure +--------- + +**Check whether user ommdba has expired.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l ommdba** command to view the information about the password of user **ommdba**. + +#. View the value of **Account expires** to check whether the user configurations have expired. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password have expired. + + - If they do, go to :ref:`3 `. + - If they do not, go to :ref:`4 `. + +#. .. _alm-12081__li1254561011: + + Run the **chage -E** *'yyyy-MM-dd'* **omm** command to set the validity period of user **ommdba**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12081__li65317618114: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383927.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12082_user_ommdba_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12082_user_ommdba_is_about_to_expire.rst new file mode 100644 index 0000000..a9b10f6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12082_user_ommdba_is_about_to_expire.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12082.html + +.. _ALM-12082: + +ALM-12082 User ommdba Is About to Expire +======================================== + +Description +----------- + +The system starts at 00:00 every day to check whether user **ommdba** is about to expire every 8 hours. This alarm is generated if the user account will expire no less than 15 days later. + +This alarm is cleared when the expiration time of user **ommdba** is reset and the user account status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12082 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The OMS database cannot be managed and data cannot be accessed. + +Possible Causes +--------------- + +The account of user **ommdba** for the host is about to expire. + +Procedure +--------- + +**Check whether user ommdba is about to expire.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l ommdba** command to view the information about user **ommdba**. + +#. View the value of **Account expires** to check whether the user configurations are about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If they are, go to :ref:`3 `. + - If they are not, go to :ref:`4 `. + +#. .. _alm-12082__li206562816581: + + Run the **chage -E** *'yyyy-MM-dd'* **ommdba** command to set the validity period of user **ommdba**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12082__li10631028105811: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383928.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12083_password_of_user_ommdba_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12083_password_of_user_ommdba_is_about_to_expire.rst new file mode 100644 index 0000000..038e22d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12083_password_of_user_ommdba_is_about_to_expire.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12083.html + +.. _ALM-12083: + +ALM-12083 Password of User ommdba Is About to Expire +==================================================== + +Description +----------- + +The system starts at 00:00 every day to check whether the password of user **ommdba** is about to expire every 8 hours. This alarm is generated if the password is about to expire no less than 15 days later. + +This alarm is cleared when the expiration time of user **ommdba** password is reset and the user password status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12083 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The OMS database cannot be managed and data cannot be accessed. + +Possible Causes +--------------- + +The password of user **ommdba** is about to expire. + +Procedure +--------- + +**Check whether the password of user ommdba in the system is about to expire.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l ommdba** command to view the information about the password of user **ommdba**. + +#. View the value of **Password expires** to check whether the user configurations are about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If they are, go to :ref:`3 `. + - If they are not, go to :ref:`4 `. + +#. .. _alm-12083__li181858713577: + + Run the **chage -M** *'days*\ **' ommdba** command to set the validity period of the password for user **ommdba**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12083__li51831473571: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383929.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12084_password_of_user_ommdba_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12084_password_of_user_ommdba_expired.rst new file mode 100644 index 0000000..0c9c170 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12084_password_of_user_ommdba_expired.rst @@ -0,0 +1,96 @@ +:original_name: ALM-12084.html + +.. _ALM-12084: + +ALM-12084 Password of User ommdba Expired +========================================= + +Description +----------- + +The system starts at 00:00 every day to check whether the password of user **ommdba** has expired every 8 hours. This alarm is generated if the password has expired. + +This alarm is cleared when the expiration time of user **ommdba** password is reset and the user password status becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12084 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The password of user **ommdba** has expired. The node trust relationship is unavailable, and FusionInsight Manager cannot manage the services. + +Possible Causes +--------------- + +The password of user **ommdba** for the host has expired. + +Procedure +--------- + +**Check whether the password of user ommdba in the system has expired.** + +#. Log in to the faulty node as user **root**. + + Run the **chage -l ommdba** command to view the information about the password of user **ommdba**. + +#. View the value of **Password expires** to check whether the user configurations have expired. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password have expired. + + - If they do, go to :ref:`3 `. + - If they do not, go to :ref:`4 `. + +#. .. _alm-12084__li6810122017542: + + Run the **chage -M** *'days*\ **' ommdba** command to set the validity period of the password for user **ommdba**. Eight hours later, check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-12084__li2808420175418: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Select **NodeAgent** for **Service** and click **OK**. + +6. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +7. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383930.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12085_service_audit_log_dump_failure.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12085_service_audit_log_dump_failure.rst new file mode 100644 index 0000000..31bb67a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12085_service_audit_log_dump_failure.rst @@ -0,0 +1,154 @@ +:original_name: ALM-12085.html + +.. _ALM-12085: + +ALM-12085 Service Audit Log Dump Failure +======================================== + +Description +----------- + +The system dumps service audit logs at 03:00 every day and stores them on the OMS node. This alarm is generated when the dump fails. This alarm is cleared when the next dump succeeds. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12085 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service audit logs may be lost. + +Possible Causes +--------------- + +- The service audit logs are oversized. +- The OMS backup storage space is insufficient. +- The storage space of a host where the service is located is insufficient. + +Procedure +--------- + +**Check whether the service audit logs are oversized.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and view the IP address of the host and additional information for which the alarm is generated. + +#. Log in to the host where the alarm is generated as user **root**. + +#. Run the **vi ${BIGDATA_LOG_HOME}/controller/scriptlog/getLogs.log** command to check whether the keyword "LOG SIZE is more than 5000MB" can be searched. + + - If it can, go to :ref:`4 `. + - If it cannot, go to :ref:`5 `. + +#. .. _alm-12085__li1525114552513: + + Check whether the oversized service audit logs are caused by exceptions. + +**The OMS backup storage space is insufficient.** + +5. .. _alm-12085__li17248145525118: + + Run the **vi ${BIGDATA_LOG_HOME}/controller/scriptlog/getLogs.log** command to check whether the keyword "Collect log failed, too many logs on" can be searched. + + - If it can, obtain the host IP address following the keyword "Collect log failed, too many logs on", and go to :ref:`6 `. + - If it cannot, go to :ref:`11 `. + +6. .. _alm-12085__li1324811555511: + + Log in to the host with the IP address obtained in :ref:`5 ` as user **root**. + +7. Run the **vi {BIGDATA_LOG_HOME}/nodeagent/scriptlog/collectLog.log** command to check whether the keyword "log size exceeds" can be searched. + + - If it can, go to :ref:`9 `. + - If it cannot, go to :ref:`8 `. + +8. .. _alm-12085__li1532033151617: + + Check whether the alarm additional information contains the keyword "no enough space". + + - If yes, go to :ref:`9 `. + - If no, go to\ :ref:`11 `. + +9. .. _alm-12085__li1411119282589: + + Perform the following operations to expand the disk capacity or reduce the maximum number of audit log backups: + + - Expand the capacity of the OMS node\ *.* + + - Run the following command to edit the file and decrease the value of **MAX_NUM_BK_AUDITLOG**. + + **vi ${CONTROLLER_HOME}/etc/om/componentsauditlog.properties** + +10. In the next execution period, 03:00, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`11 `. + +**Check whether the space of the host where the service is located is insufficient.** + +11. .. _alm-12085__li274114665213: + + Run the **vi ${BIGDATA_LOG_HOME}/controller/scriptlog/getLogs.log** command to check whether the keyword "Collect log failed, no enough space on *hostIp*" can be searched. + + - If it can, obtain the IP address of the abnormal host and go to :ref:`12 `. + - If it cannot, go to :ref:`15 `. + +12. .. _alm-12085__li137411362525: + + Log in to the host with the IP address obtained as user **root**, and run the **df "$BIGDATA_HOME/tmp" -lP \| tail -1 \| awk '{print ($4/1024)}'** command to obtain the remaining space of the host log directory. Check whether the value is less than 1000 MB. + + - If it is, go to :ref:`13 `. + - If it is not, go to :ref:`15 `. + +13. .. _alm-12085__li274186155216: + + Expand the capacity of the node + +14. In the next execution period, 03:00, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`15 `. + +**Collect fault information.** + +15. .. _alm-12085__li1181415165216: + + On FusionInsight Manager, choose **O&M**> **Log** > **Download**. + +16. Select **Controller** for **Service** and click **OK**. + +17. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +18. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383932.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12087_system_is_in_the_upgrade_observation_period.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12087_system_is_in_the_upgrade_observation_period.rst new file mode 100644 index 0000000..486330b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12087_system_is_in_the_upgrade_observation_period.rst @@ -0,0 +1,104 @@ +:original_name: ALM-12087.html + +.. _ALM-12087: + +ALM-12087 System Is in the Upgrade Observation Period +===================================================== + +Description +----------- + +The system checks whether it is in the upgrade observation period at 00:00 every day and checks whether the duration that it has been in the upgrade observation state exceeds the preset upgrade observation period, 10 days by default. This alarm is generated when the system is in the upgrade observation period and the duration that the system has been in the upgrade observation state exceeds the preset period (10 days by default). This alarm is automatically cleared if the system exits the upgrade observation period after the user performs a rollback or submission. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12087 Major Yes +======== ============== ========== + +Parameters +---------- + ++-----------------------------------+--------------------------------------------------------------------------+ +| Name | Meaning | ++===================================+==========================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-----------------------------------+--------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------------------------+--------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------------------------+--------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------------------------+--------------------------------------------------------------------------+ +| Upgrade Observation Period (Days) | Specifies the days that the system is in the upgrade observation period. | ++-----------------------------------+--------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The next upgrade or patch installation will fail. + +Possible Causes +--------------- + +The upgrade task is not submitted a specified period of time (10 days by default) after the system upgrade. + +Procedure +--------- + +**Check whether the system is in the upgrade observation period.** + +#. Log in to the active management node as user **root**. + +#. Run the following commands to switch to user **omm** and log in to the **omm** database: + + **su - omm** + + **gsql -U omm -W** *omm database password* **-p 20015** + +#. Run the **select \* from OM_CLUSTERS** command to view cluster information. + +#. Check whether the value of **upgradObservationPeriod isON** is **true**, as shown in :ref:`Figure 1 `. + + - If it is, the system is in the upgrade observation period. Use the UpdateTool to submit the upgrade task. For details, see the upgrade guide of the corresponding version. + + - If it is not, go to :ref:`6 `. + + .. _alm-12087__fig1299312444469: + + .. figure:: /_static/images/en-us_image_0269383933.png + :alt: **Figure 1** Cluster information + + **Figure 1** Cluster information + +5. In the early morning of the next day, check whether this alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-12087__li5925153912518: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select **Controller** from the **Service** and click **OK**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383934.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12089_inter-node_network_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12089_inter-node_network_is_abnormal.rst new file mode 100644 index 0000000..5b197f4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12089_inter-node_network_is_abnormal.rst @@ -0,0 +1,111 @@ +:original_name: ALM-12089.html + +.. _ALM-12089: + +ALM-12089 Inter-Node Network Is Abnormal +======================================== + +Description +----------- + +The alarm module checks the network health status of nodes in the cluster every 10 seconds. This alarm is generated when the network between two nodes is unreachable or the network status is unstable. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12089 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Functions of some components, such as HDFS and ZooKeeper, are affected. + +Possible Causes +--------------- + +- The node breaks down. +- The network is faulty. + +Procedure +--------- + +**Check the network health status.** + +#. In the alarm list on FusionInsight Manager, click the drop-down button of the alarm and view **Additional Information**. Record the source IP address and destination IP address of the node for which the alarm is reported. + +#. .. _alm-12089__li189988381537: + + Log in to the node for which the alarm is reported. On the node, ping the target node to check whether the network between the two nodes is normal. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`3 `. + +**Check the node status.** + +3. .. _alm-12089__li184601124820: + + On FusionInsight Manager, click **Host** and check whether the host list contains the faulty node to determine whether the faulty node has been removed from the cluster. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +4. .. _alm-12089__li19460824120: + + Check whether the faulty node is powered off. + + - If yes, start the faulty node and go to :ref:`2 `. + - If no, contact related personnel to find root cause, if need to remove the faulty nodes from the cluster and go to :ref:`5 `, otherwise go to :ref:`6 `. + +5. .. _alm-12089__li746012241226: + + Remove the file **$NODE_AGENT_HOME/etc/agent/hosts.ini** of all nodes in the cluster, and clean up the file **/var/log/Bigdata/unreachable/unreachable_ip_info.log**, and then manually clear the alarm. + +6. .. _alm-12089__li1646022411214: + + Wait for 30 seconds and checking if the alarm was been cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-12089__li69951938132: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **OmmAgent** from the **Service** and click **OK**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383936.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12101_az_unhealthy.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12101_az_unhealthy.rst new file mode 100644 index 0000000..1d1a81c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12101_az_unhealthy.rst @@ -0,0 +1,99 @@ +:original_name: ALM-12101.html + +.. _ALM-12101: + +ALM-12101 AZ Unhealthy +====================== + +Description +----------- + +After the AZ DR function is enabled, the system checks the AZ health status every 5 minutes. This alarm is generated when the system detects that the AZ is subhealthy or unhealthy. This alarm is cleared when the AZ becomes healthy. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12101 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +AZName Specifies the AZ for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The health status of an AZ is determined by whether the health status of storage resources (HDFS), computing resources (Yarn), and key roles in the AZ exceeds the configured threshold. + +An AZ is subhealthy when: + +- The computing resources (Yarn) are unhealthy, but the storage resources (HDFS) are healthy. Tasks cannot be submitted to the local AZ, but data can still be read and written in the local AZ. +- The computing resources (Yarn) are healthy, but some storage resources (HDFS) are unhealthy. Tasks can be submitted to the local AZ, and some data can be read and written in the local AZ. This depends on the locality of data detected by Spark/Hive scheduling. + +An AZ is unhealthy when: + +- The computing resources (Yarn) are healthy, but the storage resources (HDFS) are unhealthy. Although tasks can be submitted to the local AZ, data cannot be read or written in the local AZ. As a result, the tasks submitted to the local AZ are invalid. +- The computing resources (Yarn) and storage resources (HDFS) are unhealthy. Tasks cannot be submitted to the local AZ, and data cannot be read or written in the local AZ. +- The health status of key roles except Yarn and HDFS is lower than the configured threshold. + +Possible Causes +--------------- + +- The computing resources (Yarn) are unhealthy. +- The storage resources (HDFS) are unhealthy. +- Some storage resources (HDFS) are unhealthy. +- Key roles except Yarn and HDFS are unhealthy. + +Procedure +--------- + +**Disable the DR drill.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Cross-AZ HA**. The Cross-AZ HA page is displayed. + +#. In the AZ DR list, check whether **Perform DR Drill** in the **Operation** column of the AZ whose health status is **Unhealthy** is gray. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-12101__li1076171313521: + + Click **Restore** in the **Operation** column of the target AZ. Wait 2 minutes and refresh the page to view the health status of the AZ. Check whether the health status is normal. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect the fault information.** + +4. .. _alm-12101__li57606134528: + + Log in to the active management node as user **root**. + +5. View logs of unhealthy services. + + - HDFS log files are stored in **/var/log/Bigdata/hdfs/nn/hdfs-az-state.log**. + - Yarn log files are stored in **/var/log/Bigdata/yarn/rm/yarn-az-state.log**. + - For other services, view the service health check logs in the corresponding service log directory. + +6. Contact O&M personnel and provide detailed log file information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12102_az_ha_component_is_not_deployed_based_on_dr_requirements.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12102_az_ha_component_is_not_deployed_based_on_dr_requirements.rst new file mode 100644 index 0000000..c8bc528 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12102_az_ha_component_is_not_deployed_based_on_dr_requirements.rst @@ -0,0 +1,68 @@ +:original_name: ALM-12102.html + +.. _ALM-12102: + +ALM-12102 AZ HA Component Is Not Deployed Based on DR Requirements +================================================================== + +Description +----------- + +The alarm module checks the deployment status of AZ HA components every 5 minutes. This alarm is generated when the components that support DR are not deployed based on DR requirements after AZ is enabled. This alarm is cleared when the components are deployed based on DR requirements. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12102 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cross-AZ HA capability of a single cluster is affected. + +Possible Causes +--------------- + +The roles of the components that support DR are not deployed based on DR requirements. + +Procedure +--------- + +**Obtain alarm information.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. +#. In the alarm list, click |image1| in the row that contains the alarm and view the roles that are not deployed based on DR requirements in **Additional Information**. + +**Redeploy the role instance.** + +3. Choose **Cluster** > **Services** > *Name of the desired service* > **Instance**. On the instance page, redeploy or adjust the role instance. +4. Check whether the alarm is cleared 10 minutes later. + + - If yes, no further action is required. + - If no, contact O&M personnel. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001085773316.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12110_failed_to_get_ecs_temporary_ak_sk.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12110_failed_to_get_ecs_temporary_ak_sk.rst new file mode 100644 index 0000000..9965bb6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12110_failed_to_get_ecs_temporary_ak_sk.rst @@ -0,0 +1,86 @@ +:original_name: ALM-12110.html + +.. _ALM-12110: + +ALM-12110 Failed to get ECS temporary AK/SK +=========================================== + +Description +----------- + +The meta service periodically obtains the temporary AK/SK of the ECS. This alarm is generated when the meta service fails to obtain the temporary AK/SK. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12110 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +In storage-compute decoupling scenarios, the cluster cannot obtain the latest temporary AK/SK, which may lead to failure access to OBS. + +Possible Causes +--------------- + +- The meta role of the MRS cluster is abnormal. +- The cluster has been bound to an agency and accessed OBS but has been unbound from the agency. As a result, the cluster has not been bound to any agency. + +Procedure +--------- + +**Check the status of the meta role.** + +#. On FusionInsight Manager of the cluster, choose **O&M** > **Alarm** > **Alarms**. On the page that is displayed, click |image1| in the row containing the alarm, and determine the IP address of the host for which the alarm is generated. + +#. On FusionInsight Manager of the cluster, choose **Cluster** > **Services** > **Meta**. On the page that is displayed, click the **Instance** tab, and check whether the meta role corresponding to the host for which the alarm is generated is normal. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-12110__li183754571814: + + Select the abnormal role and choose **More** > **Restart Instance** to restart the abnormal meta role. After the restart is complete, check whether the alarm is cleared several minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Rebind the cluster to an agency.** + +4. .. _alm-12110__li1046594463117: + + Log in to the MRS management console. + +5. In the navigation pane on the left, choose **Clusters** > **Active Clusters**. On the page that is displayed, click the cluster name to go to its overview page. Then, check whether the cluster is bound to an agency in the O&M management area. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +6. .. _alm-12110__li5465164415315: + + Click **Manage Agency**. On the page that is displayed, rebind the cluster to an agency. Then check whether the alarm is cleared a few minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +7. .. _alm-12110__li546514420312: + + Contact O&M personnel. + +.. |image1| image:: /_static/images/en-us_image_0000001216164294.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12180_suspended_disk_i_o.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12180_suspended_disk_i_o.rst new file mode 100644 index 0000000..0bf555b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-12180_suspended_disk_i_o.rst @@ -0,0 +1,133 @@ +:original_name: ALM-12180.html + +.. _ALM-12180: + +ALM-12180 Suspended Disk I/O +============================ + +Description +----------- + +- For HDDs, the alarm is triggered when any of the following conditions is met: + + - The system collects data every 3 seconds, and detects that the **svctm** value exceeds 6s for 10 consecutive periods within 30 seconds. + - The system collects data every 3 seconds, and detects that the **avgqu-sz** value is greater than 0, the IOPS or bandwidth is 0, and the **ioutil** value is greater than **99%** for 10 consecutive periods within 30 seconds. + +- For SSDs, the alarm is triggered when any of the following conditions is met: + + - The system collects data every 3 seconds, and detects that the **svctm** value exceeds 3s for 10 consecutive periods within 30 seconds. + - The system collects data every 3 seconds, and detects that the **avgqu-sz** value is greater than 0, the IOPS or bandwidth is 0, and the **ioutil** value is greater than **99%** for 10 consecutive periods within 30 seconds. + +This alarm is automatically cleared when the preceding conditions have not been met for 90s. + +.. note:: + + - Run the following command in the OS to collect data: + + **iostat -x -t 1 1** + + |image1| + + Parameters are as follows: + + - **avgqu-sz** indicates the disk queue depth. + - The sum of **r/s** and **w/s** is the IOPS. + - The sum of **rkB/s** and **wkB/s** is the bandwidth. + - **%util** is the **ioutil** value. + + - The formula for calculating **svctm** is as follows: + + svctm = (tot_ticks_new - tot_ticks_old) / (rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old) + + If **rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old** is **0**, then **svctm** is **0**. + + The parameters can be obtained as follows: + + The system runs the **cat /proc/diskstats** command every 3 seconds to collect data. For example: + + |image2| + + In these two commands: + + In the data collected for the first time, the number in the fourth column is the **rd_ios_old** value, the number in the eighth column is the **wr_ios_old** value, and the number in the thirteenth column is the **tot_ticks_old** value. + + In the data collected for the second time, the number in the fourth column is the **rd_ios_new** value, the number in the eighth column is the **wr_ios_new** value, and the number in the thirteenth column is the **tot_ticks_new** value. + + In this case, the value of **svctm** is as follows: + + (19571460 - 19569526) / (1101553 + 28747977 - 1101553 - 28744856) = 0.6197 + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12180 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| DiskName | Specifies the disk for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A continuously high I/O usage may adversely affect service operations and result in service loss. + +Possible Causes +--------------- + +The disk is aged. + +Procedure +--------- + +**Replace the disk.** + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. +#. View the detailed information about the alarm. Check the values of **HostName** and **DiskName** in the location information to obtain the information about the faulty disk for which the alarm is reported. +#. Replace the hard disk. +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect fault information.** + +5. .. _alm-12180__li1050815217817: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. Select **OMS** for **Service** and click **OK**. + +7. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001375901064.png +.. |image2| image:: /_static/images/en-us_image_0000001426500589.png +.. |image3| image:: /_static/images/en-us_image_0000001405224197.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst new file mode 100644 index 0000000..02c0595 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst @@ -0,0 +1,200 @@ +:original_name: ALM-13000.html + +.. _ALM-13000: + +ALM-13000 ZooKeeper Service Unavailable +======================================= + +Description +----------- + +The system checks the ZooKeeper service status every 60 seconds. This alarm is generated when the ZooKeeper service is unavailable. + +This alarm is cleared when the ZooKeeper service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +ZooKeeper cannot provide coordination services for upper layer components and the components that depend on ZooKeeper may not run properly. + +Possible Causes +--------------- + +- The DNS is installed on the ZooKeeper node. +- The network is faulty. +- The KrbServer service is abnormal. +- The ZooKeeper instance is abnormal. +- The disk capacity is insufficient. + +Procedure +--------- + +**Check the DNS.** + +#. Check whether the DNS is installed on the node where the ZooKeeper instance is located. On the Linux node where the ZooKeeper instance is located, run the **cat /etc/resolv.conf** command to check whether the file is empty. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-13000__li76816175112: + + Run the **service named status** command to check whether the DNS is started. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-13000__li86969116511: + + Run the **service named stop** command to stop the DNS service. If "Shutting down name server BIND waiting for named to shut down (28s)" is displayed, the DNS service is stopped successfully. Comment out the content (if any) in **/etc/resolv.conf**. + +#. On the **O&M > Alarm > Alarms** tab, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the network status.** + +5. .. _alm-13000__li42741615115119: + + On the Linux node where the ZooKeeper instance is located, run the **ping** command to check whether the host names of other nodes where the ZooKeeper instance is located can be pinged successfully. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`6 `. + +6. .. _alm-13000__li1227471525111: + + Modify the IP addresses in **/etc/hosts** and add the host name and IP address mapping. + +7. Run the **ping** command again to check whether the host names of other nodes where the ZooKeeper instance is located can be pinged successfully. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`23 `. + +8. .. _alm-13000__li129021555116: + + On the **O&M > Alarm > Alarms** tab, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Check the KrbServer service status (Skip this step if the normal mode is used).** + +9. .. _alm-13000__li26784425155523: + + On FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Services**. + +10. Check whether the KrbServer service is normal. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`11 `. + +11. .. _alm-13000__li45270224155523: + + Perform operations based on "ALM-25500 KrbServer Service Unavailable" and check whether the KrbServer service is recovered. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`23 `. + +12. .. _alm-13000__li4125272155523: + + On the **O&M > Alarm > Alarms** tab, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Check the ZooKeeper service instance status.** + +13. .. _alm-13000__li21042948155523: + + On FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **quorumpeer**. + +14. Check whether the ZooKeeper instances are normal. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`15 `. + +15. .. _alm-13000__li36165444155523: + + Select instances whose status is not good, and choose **More** > **Restart Instance**. + +16. Check whether the instance status is good after restart. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`18 `. + +17. .. _alm-13000__li65308695155523: + + On the **O&M > Alarm > Alarms** tab, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +**Check disk status.** + +18. .. _alm-13000__li253728155523: + + On FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Service** > **ZooKeeper** > **quorumpeer**, and check the node host information of the ZooKeeper instance. + +19. On FusionInsight Manager, click **Host**. + +20. In the **Disk** column, check whether the disk space of each node where ZooKeeper instances are located is insufficient (disk usage exceeds 80%). + + - If yes, go to :ref:`21 `. + - If no, go to :ref:`23 `. + +21. .. _alm-13000__li23393056155523: + + Expand disk capacity. For details, see "ALM-12017 Insufficient Disk Capacity". + +22. On the **O&M > Alarm > Alarms** tab, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`23 `. + +**Collect fault information.** + +23. .. _alm-13000__li42883384155523: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +24. Select the following nodes in the required cluster from the **Service**: (KrbServer logs do not need to be downloaded in normal mode.) + + - ZooKeeper + - KrbServer + +25. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +26. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383940.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst new file mode 100644 index 0000000..fcf4000 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst @@ -0,0 +1,139 @@ +:original_name: ALM-13001.html + +.. _ALM-13001: + +ALM-13001 Available ZooKeeper Connections Are Insufficient +========================================================== + +Description +----------- + +The system checks ZooKeeper connections every 60 seconds. This alarm is generated when the system detects that the number of used ZooKeeper instance connections exceeds the threshold (80% of the maximum connections). + +When the **Trigger Count** is 1, this alarm is cleared when the number of used ZooKeeper instance connections is smaller than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the number of used ZooKeeper instance connections is smaller than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Available ZooKeeper connections are insufficient. When the connection usage reaches 100%, external connections cannot be handled. + +Possible Causes +--------------- + +The number of connections to the ZooKeeper node exceeds the threshold. Connection leakage occurs on some connection processes, or the maximum number of connections does not comply with the actual scenario. + +Procedure +--------- + +**Check connection status.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **Available ZooKeeper Connections Are Insufficient** and confirm the node IP address of the host for which the alarm is generated in the Location Information. + +#. Obtain the PID of the ZooKeeper process. Log in to the node involved in this alarm as user **root** and run the **pgrep -f proc_zookeeper** command. + +#. Check whether the PID can be correctly obtained. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`15 `. + +#. .. _alm-13001__li1510096916326: + + Obtain all the IP addresses connected to the ZooKeeper instance and the number of connections and check 10 IP addresses with top connections. Run the following command based on the obtained PID: **lsof -i|grep** *$pid* **\| awk '{print $9}' \| cut -d : -f 2 \| cut -d \\>-f 2 \| awk '{a[$1]++} END {for(i in a){print i,a[i] \| "sort -r -g -k 2"}}' \| head -10**. (The PID obtained in the preceding step is used.) + +#. Check whether node IP addresses and number of connections are successfully obtained. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`15 `. + +#. .. _alm-13001__li2774376816326: + + Obtain the ID of the port connected to the process. Run the following command based on the obtained PID and IP address: **lsof -i|grep** *$pid* **\| awk '{print $9}'|cut -d \\> -f 2 \|grep** *$IP*\ **\| cut -d : -f 2**. (The PID and IP address obtained in the preceding step are used.) + +#. Check whether the port ID is successfully obtained. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`15 `. + +#. .. _alm-13001__li4326299016326: + + Obtain the ID of the connected process. Log in to each IP address and run the following command based on the obtained port ID: **lsof -i|grep** *$port*. (The port ID obtained in the preceding step is used.) + +#. Check whether the process ID is successfully obtained. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`15 `. + +#. .. _alm-13001__li5163582316326: + + Check whether connection leakage occurs on the process based on the obtained process ID. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`12 `. + +#. .. _alm-13001__li1962513916326: + + Close the process where connection leakage occurs and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +#. .. _alm-13001__li6677995916326: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations** > **All** **Configurations** > **quorumpeer** > **Performance** and increase the value of **maxCnxns** as required. + +#. Save the configuration and restart the ZooKeeper service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Collect fault information.** + +15. .. _alm-13001__li2333852416326: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +16. Select **ZooKeeper** in the required cluster from the **Service**: + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact the O&M personnel and send the collected log information. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383942.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13002_zookeeper_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13002_zookeeper_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..75f4eb9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13002_zookeeper_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,112 @@ +:original_name: ALM-13002.html + +.. _ALM-13002: + +ALM-13002 ZooKeeper Direct Memory Usage Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the direct memory usage of the ZooKeeper service every 30 seconds. The alarm is generated when the direct memory usage of a ZooKeeper instance exceeds the threshold (80% of the maximum memory). + +When the **Trigger Count** is 1, this alarm is cleared when the ZooKeeper Direct memory usage is less than the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the ZooKeeper Direct memory usage is less than 80% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the ZooKeeper service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the ZooKeeper instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **ZooKeeper Direct Memory Usage Exceeds the Threshold**. Check the IP address of the instance that reports the alarm. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Instance** > **quorumpeer(the IP address checked)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **CPU and Memory**, and select **ZooKeeper Heap And Direct Buffer Resource Percentage**, click **OK**. + +#. Check whether the used direct buffer memory of ZooKeeper reaches 80% of the maximum direct buffer memory specified for ZooKeeper. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-13002__li57922773161213: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations** > **All** **Configurations** > **quorumpeer** > **System** to check whether "-XX:MaxDirectMemorySize" exists in the **GC_OPTS** parameter. + + - If yes, in the **GC_OPTS** parameter, delete "-XX:MaxDirectMemorySize" and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-13002__li51542910161213: + + Save the configuration and restart the ZooKeeper service. + +#. .. _alm-13002__li16393123713315: + + Check whether the **ALM-13004 ZooKeeper Heap Memory Usage Exceeds the Threshold** exists. + + - If yes, handle the alarm by referring to **ALM-13004 ZooKeeper Heap Memory Usage Exceeds the Threshold**. + - If no, go to :ref:`7 `. + +#. .. _alm-13002__li56397739161213: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-13002__li43327670161213: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +9. Select **ZooKeeper** in the required cluster from the **Service**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383943.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13003_gc_duration_of_the_zookeeper_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13003_gc_duration_of_the_zookeeper_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..90a90c9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13003_gc_duration_of_the_zookeeper_process_exceeds_the_threshold.rst @@ -0,0 +1,106 @@ +:original_name: ALM-13003.html + +.. _ALM-13003: + +ALM-13003 GC Duration of the ZooKeeper Process Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the garbage collection (GC) duration of the ZooKeeper process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the ZooKeeper process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the ZooKeeper process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +**Check the GC duration.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the displayed page, click the drop-down list of **GC Duration of the ZooKeeper Process Exceeds the Threshold**. View the IP address of the instance for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper** > **Instance** > **quorumpeer**. Click the drop-down list in the upper right corner of **Chart**, choose **Customize** > **GC**, select **ZooKeeper GC Duration per Minute**, and click **OK** to check the GC duration statistics of the ZooKeeper process collected every minute. + +#. Check whether the GC duration of the ZooKeeper process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-13003__li1332215316392: + + Check whether memory leakage occurs in the application. + +#. On the **Home** page of FusionInsight Manager, choose **Cluster** > **Services** > **ZooKeeper**. On the page that is displayed, click the **Configuration** tab then the **All Configurations** sub-tab, and select **quorumpeer** > **System**. Increase the value of the **GC_OPTS** parameter as required. + + .. note:: + + Generally, **-Xmx** is twice of ZooKeeper data capacity. If the capacity of ZooKeeper reaches 2 GB, set **GC_OPTS** as follows: + + -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=512M -XX:MetaspaceSize=64M -XX:MaxMetaspaceSize=64M -XX:CMSFullGCsBeforeCompaction=1 + +#. Save the configuration and restart the ZooKeeper service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-13003__li12535847161327: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **ZooKeeper** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13004_zookeeper_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13004_zookeeper_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..46f0be3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13004_zookeeper_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,101 @@ +:original_name: ALM-13004.html + +.. _ALM-13004: + +ALM-13004 ZooKeeper Heap Memory Usage Exceeds the Threshold +=========================================================== + +Description +----------- + +The system checks the heap memory usage of the ZooKeeper service every 60 seconds. The alarm is generated when the heap memory usage of a ZooKeeper instance exceeds the threshold (95% of the maximum memory). + +The alarm is cleared when the memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13004 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available ZooKeeper heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the ZooKeeper instance is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, On the displayed interface, click the drop-down button of **ZooKeeper Heap Memory Usage Exceeds the Threshold** and confirm the node IP address of the host for which the alarm is generated in the Location Information. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Instance**, click **quorumpeer** in the **Role** column of the corresponding IP address. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **CPU and Memory**, and select **ZooKeeper Heap And Direct Buffer Resource Percentage**, click **OK**. Check the heap memory usage. + +#. Check whether the used heap memory of ZooKeeper reaches 95% of the maximum heap memory specified for ZooKeeper. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-13004__li66283273161727: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations** > **All** **Configurations** > **quorumpeer** > **System**. Increase the value of **-Xmx** in **GC_OPTS** as required. The details are as follows: + + a. On the **Instance** tab, click **quorumpeer** in the **Role** column of the corresponding IP address. Choose **Customize** > **CPU and Memory** in the upper right corner, and select **ZooKeeper Heap And Direct Buffer Resource**, click **OK** to check the heap memory used by ZooKeeper. + b. Change the value of **-Xmx** in the **GC_OPTS** parameter based on the actual heap memory usage. Generally, the value is twice the size of the ZooKeeper data volume. For example, if 2 GB ZooKeeper heap memory is used, the following configurations are recommended: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=512M -XX:MetaspaceSize=64M -XX:MaxMetaspaceSize=64M -XX:CMSFullGCsBeforeCompaction=1 + +#. Save the configuration and restart the ZooKeeper service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-13004__li34986499161727: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **ZooKeeper** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383945.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13005_failed_to_set_the_quota_of_top_directories_of_zookeeper_components.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13005_failed_to_set_the_quota_of_top_directories_of_zookeeper_components.rst new file mode 100644 index 0000000..1bcde5a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13005_failed_to_set_the_quota_of_top_directories_of_zookeeper_components.rst @@ -0,0 +1,116 @@ +:original_name: ALM-13005.html + +.. _ALM-13005: + +ALM-13005 Failed to Set the Quota of Top Directories of ZooKeeper Components +============================================================================ + +Description +----------- + +The system sets quotas for each ZooKeeper top-level directory in the **customized.quota** configuration item and components every 5 hours. This alarm is generated when the system fails to set the quota for a directory. + +This alarm is cleared when the setting succeeds after a failure. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +13005 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceDirectory | Specifies the directory for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the cause of the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +Components can write a large amount of data to the top-level directory of ZooKeeper. As a result, the ZooKeeper service is unavailable. + +Possible Causes +--------------- + +The quota for the alarm directory is inappropriate. + +Procedure +--------- + +**Check whether the quota for the alarm directory is appropriate.** + +#. Log in to FusionInsight Manager, and choose **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper**. On the displayed page, choose **Configurations** > **All Configurations** > **Quota**. Check whether the directory for which the alarm is reported and its quota exist in the **customized.quota** configuration item. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`2 `. + +#. .. _alm-13005__li8446514299: + + Check whether the alarm directory for which the alarm is reported is in the following alarm list. + + .. table:: **Table 1** Component alarm directory + + ========= =============== + Component Alarm Directory + ========= =============== + Hbase /hbase + Hive /beelinesql + Yarn /rmstore + Storm /stormroot + Streaming /storm + Kafka /kafka + ========= =============== + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-13005__li1454514461317: + + View the component of the alarm directory in the table, open the corresponding service page, and choose **Configurations** > **All Configurations**. On the displayed page, search for **zk.quota** in the upper right corner. The search result is the quota of the alarm directory. + +#. Check whether the quota of the alarm directory for which the alarm is reported is appropriate. The quota must be greater than or equal to the actual value, which can be obtained in **Trigger Condition**. + +#. .. _alm-13005__li38538052161727: + + Modify the **services.quota** value as prompted and save the configuration. + +#. After the time specified by **service.quotas.auto.check.cron.expression**, check whether the alarm is cleared. + + - If it is, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-13005__li34986499161727: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **ZooKeeper** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383946.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13006_znode_number_or_capacity_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13006_znode_number_or_capacity_exceeds_the_threshold.rst new file mode 100644 index 0000000..61fadf1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13006_znode_number_or_capacity_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-13006.html + +.. _ALM-13006: + +ALM-13006 Znode Number or Capacity Exceeds the Threshold +======================================================== + +Description +----------- + +The system periodically detects the status of secondary Znode in the ZooKeeper service data directory every four hours. This alarm is generated when the number or capacity of secondary Znodes exceeds the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +13006 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceDirectory | Specifies the directory for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the cause of the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +A large amount of data is written to the ZooKeeper data directory. As a result, ZooKeeper cannot provide normal services. + +Possible Causes +--------------- + +A large amount of data is written to the ZooKeeper data directory. The threshold is not appropriate. + +Procedure +--------- + +**Check whether a large amount of data is written to the directory for which the alarm is generated.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **Znode Number or Capacity Exceeds the Threshold**. Confirm the Znode for which the alarm is generated in Location Information. + +#. Log in to FusionInsight Manager, open the ZooKeeper service interface, and select **Resource**. In the table **Used Resources (By Second-Level Znode)**, check whether a large amount of data is written to the top-level Znode for which the alarm is reported. + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-13006__li47651981813: + + Log in to the ZooKeeper client and delete the data in the top-level Znode. + +#. .. _alm-13006__li3567539141610: + + Log in to FusionInsight Manager and open the ZooKeeper service interface. On the **Resource** page, choose |image1| > **By Znode quantity** in **Used Resources (By Second-Level Znode)**. **Threshold** **Configuration of By Znode quantity** is displayed. Click **Modify** under **Operation**. Increase the threshold by referringto the value of **max.Znode.count** by choosing **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations > All Configurations** **> Quota**. + +#. In the **Used Resources (By Second-Level Znode)**, choose |image2| > **By capacity**. The **Threshold Settings** page of **By Capacity** is displayed. Click **Modify** under **Operation**. Increase the threshold by referring to the value of **max.data.size** by choosing **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations > All Configurations** > **Quota**. + +#. Check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-13006__li27634817118: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **ZooKeeper** in the required cluster from the **Service**. + +9. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001239732331.gif +.. |image2| image:: /_static/images/en-us_image_0000001239292351.gif +.. |image3| image:: /_static/images/en-us_image_0269383949.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13007_available_zookeeper_client_connections_are_insufficient.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13007_available_zookeeper_client_connections_are_insufficient.rst new file mode 100644 index 0000000..29c9548 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13007_available_zookeeper_client_connections_are_insufficient.rst @@ -0,0 +1,101 @@ +:original_name: ALM-13007.html + +.. _ALM-13007: + +ALM-13007 Available ZooKeeper Client Connections Are Insufficient +================================================================= + +Description +----------- + +The system periodically detects the number of active processes between the ZooKeeper client and the ZooKeeper server every 60 seconds. This alarm is generated when the number of connections exceeds the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +13007 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ClientIP | Specifies the client IP address. | ++-------------------+--------------------------------------------------------------+ +| ServerIP | Specifies the server IP address. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the cause of the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +A large number of connections to ZooKeeper caused the ZooKeeper to be fully connected and unable to provide normal services. + +Possible Causes +--------------- + +A large number of client processes are connected to ZooKeeper. The thresholds are not appropriate. + +Procedure +--------- + +**Check whether there are a large number of client processes connected to ZooKeeper.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **Available ZooKeeper Client Connections Are Insufficient**. Confirm the node IP address of the host for which the alarm is generated in the Location Information. + +#. Open the ZooKeeper service interface, click **Resource** to enter the **Resource** page, and check whether the number of connections of the client with the IP address specified by **Number of Connections** **(By Client IP Address)** is large. + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`4 `. + +#. .. _alm-13007__li9739531132620: + + Check whether connection leakage occurs on the client process. + +#. .. _alm-13007__li1373973122619: + + Click\ |image1| in the **Number of Connections** **(by Client IP Address)** to enter the **Thresholds** page, and click **Modify** under **Operation**. Increase the threshold by referring to the value of **maxClientCnxns** by choosing **Cluster >** *Name of the desired cluster* **> Services** > **ZooKeeper** > **Configurations > All Configurations > quorumpeer**. + +#. Check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-13007__li27361331112613: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select **ZooKeeper** in the required cluster from the **Service**. + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383950.gif +.. |image2| image:: /_static/images/en-us_image_0269383952.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13008_zookeeper_znode_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13008_zookeeper_znode_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..7f2fce7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13008_zookeeper_znode_usage_exceeds_the_threshold.rst @@ -0,0 +1,97 @@ +:original_name: ALM-13008.html + +.. _ALM-13008: + +ALM-13008 ZooKeeper Znode Usage Exceeds the Threshold +===================================================== + +Description +----------- + +The system checks the level-2 Znode status in the ZooKeeper data directory every hour. This alarm is generated when the system detects that the level-2 Znode usage exceeds the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +13008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceDirectory | Specifies the directory for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the cause of the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +A large amount of data is written to the ZooKeeper data directory. As a result, ZooKeeper cannot provide services properly. + +Possible Causes +--------------- + +- A large amount of data is written to the ZooKeeper data directory. +- The user-defined threshold is inappropriate. + +Procedure +--------- + +**Check whether a large amount of data is written into the directory for which the alarm is generated.** + +#. Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper**, and click **Resource**. Click **By Znode quantity** in **Used Resources (By Second-Level Znode)**, and check whether a large amount of data is written to the top Znode. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-13008__li787172215383: + + Log in to FusionInsight Manager, choose **O&M > Alarm > Alarms**, select **Location** from the drop-down list box next to **ALM-13008 ZooKeeper Znode Quantity Usage Exceeds Threshold**, and obtain the Znode path in **ServiceDirectory**. + +#. Log in to the ZooKeeper client as a cluster user and delete unnecessary data from the Znode corresponding to the alarm. + +#. .. _alm-13008__li10279134491613: + + Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper** > **Configurations** > **All Configurations**, and search for **max.znode.count**, which is the maximum number of ZooKeeper directories. The alarm threshold is 80% of this parameter. Increase the value of this parameter, click **Save**, and restart the service for the configuration to take effect. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-13008__li180651333416: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select **ZooKeeper** in the required cluster from the **Service**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383953.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13009_zookeeper_znode_capacity_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13009_zookeeper_znode_capacity_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..88157fd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13009_zookeeper_znode_capacity_usage_exceeds_the_threshold.rst @@ -0,0 +1,128 @@ +:original_name: ALM-13009.html + +.. _ALM-13009: + +ALM-13009 ZooKeeper Znode Capacity Usage Exceeds the Threshold +============================================================== + +Description +----------- + +The system checks the level-2 ZNode status in the ZooKeeper data directory every hour. This alarm is generated when the system detects that the capacity usage exceeds the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-----------------------------------------------------------+ +| Name | Meaning | ++===================+===========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| ServiceDirectory | Specifies the directory for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-----------------------------------------------------------+ + +Impact on the System +-------------------- + +A large amount of data is written to the ZooKeeper data directory. As a result, ZooKeeper cannot provide services properly. + +Possible Causes +--------------- + +- A large volume of data has been written to the ZooKeeper data directory. +- The threshold is improperly defined. + +Procedure +--------- + +**Check whether a large volume of data is written to the alarm directory.** + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. Click the drop-down list in the row containing **ALM-13009 ZooKeeper ZNode Capacity Usage Exceeds the Threshold**, and find the ZNode for which the alarm is generated in the **Location** area. + +#. Choose **Cluster** > **Services** > **ZooKeeper**. On the page that is displayed, click the **Resource** tab. In the **Used Resources (By Second-Level ZNode)** area, click **By capacity** and check whether a large amount of data is written to the top-level ZNode directory. + + - If yes, record the directory to which a large amount of data is written and go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-13009__li151971257113310: + + Check whether data in the directory can be deleted. + + .. important:: + + Deleting data from ZooKeeper is a high-risk operation. Exercise caution when performing this operation. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-13009__li40737202161840: + + Log in to the ZooKeeper client and delete unnecessary data from the directory to which a large amount of data is written. + + a. Log in to the ZooKeeper client installation directory, for example, **/opt/client**, and configure environment variables. + + **cd /opt/client** + + **source bigdata_env** + + b. Run the following command to authenticate the user (skip this step for a cluster in normal mode): + + **kinit** *Component service user* + + c. Run the following command to log in to the client tool: + + **zkCli.sh -server** **<**\ *Service IP address of the node where any ZooKeeper instance resides*\ **>:<**\ *Client port*\ **>** + + d. Run the following command to delete unnecessary data: + + **delete** *Path of the file to be deleted* + +#. .. _alm-13009__li1932073512913: + + Log in to FusionInsight Manager and choose **Cluster** > **Services** > **ZooKeeper**. On the page that is displayed, click the **Configuration** tab then the **All Configurations** sub-tab, and search for **max.data.size**. The value of **max.data.size** is the maximum capacity quota of the ZooKeeper directory. The unit is byte. Search for the **GC_OPTS** configuration item and check the value of **Xmx**. + +#. Compare the values of **max.data.size** and **Xmx*0.65**. The threshold is the smaller value multiplied by 80%. You can change the values of **max.data.size** and **Xmx*0.65** to increase the threshold. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-13009__li57092876161840: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **ZooKeeper** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895683.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13010_znode_usage_of_a_directory_with_quota_configured_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13010_znode_usage_of_a_directory_with_quota_configured_exceeds_the_threshold.rst new file mode 100644 index 0000000..870aa9b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-13010_znode_usage_of_a_directory_with_quota_configured_exceeds_the_threshold.rst @@ -0,0 +1,103 @@ +:original_name: ALM-13010.html + +.. _ALM-13010: + +ALM-13010 Znode Usage of a Directory with Quota Configured Exceeds the Threshold +================================================================================ + +Description +----------- + +The system checks the Znode usage of all service directories with quota configured every hour. This alarm is generated when the system detects that the level-2 Znode usage exceeds the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +13010 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| ServiceDirectory | Specifies the directory for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the cause of the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +A large amount of data is written to the ZooKeeper data directory. As a result, ZooKeeper cannot provide services properly. + +Possible Causes +--------------- + +- A large amount of data is written to the ZooKeeper data directory. +- The user-defined threshold is inappropriate. + +Procedure +--------- + +**Check whether a large amount of data is written into the directory for which the alarm is generated.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm > Alarms**. Confirm the Znode for which the alarm is generated in **Location** of this alarm. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper** and click **Resource**. In **Used Resources (By Second-Level Znode)**, check whether a large amount of data is written into the top Znode. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. Log in to FusionInsight Manager, choose **O&M > Alarm > Alarms**, select Location from the drop-down list box next to **ALM-13010 Znode Usage of a Directory with Quota Configured Exceeds the Threshold**, and obtain the Znode path in ServiceDirectory. + +#. .. _alm-13010__li1298122393514: + + Log in to the ZooKeeper client as a cluster user and delete unwanted data in the Znode for which the alarm is generated. + +#. .. _alm-13010__li598192363510: + + Log in to FusionInsight Manager, and choose **Cluster** > *Name of the desired cluster* > **Services** > *Component of the top Znode for which the alarm isgenerated*. Choose **Configurations** > **All Configurations**, search for **zk.quota.number**, increase its value, click **Save**. + + .. important:: + + If the Component of the top Znode for which the alarm isgenerated is ClickHouse, change the value of **clickhouse.zookeeper.quota.node.count**. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-13010__li13978523123518: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **ZooKeeper** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383956.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14000_hdfs_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14000_hdfs_service_unavailable.rst new file mode 100644 index 0000000..27c665a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14000_hdfs_service_unavailable.rst @@ -0,0 +1,119 @@ +:original_name: ALM-14000.html + +.. _ALM-14000: + +ALM-14000 HDFS Service Unavailable +================================== + +Description +----------- + +The system checks the NameService service status every 60 seconds. This alarm is generated when all the NameService services are abnormal and the system considers that the HDFS service is unavailable. + +This alarm is cleared when at least one NameService service is normal and the system considers that the HDFS service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14000 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +HDFS fails to provide services for HDFS service-based upper-layer components, such as HBase and MapReduce. As a result, users cannot read or write files. + +Possible Causes +--------------- + +- The ZooKeeper service is abnormal. +- All NameService services are abnormal. + +Procedure +--------- + +**Check the ZooKeeper service status.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. On the Alarm page, check whether **ALM-13000 ZooKeeper Service Unavailable** is reported. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14000__li31253013162114: + + See **ALM-13000 ZooKeeper Service Unavailable** to rectify the health status of ZooKeeper fault and check whether the **Running** **Status** of the ZooKeeper service restores to **Normal**. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-14000__li19570784162114: + + On the **O&M > Alarm > Alarms** page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Handle the NameService service exception alarm.** + +4. .. _alm-14000__li57039289162114: + + On the FusionInsight Manager portal, choose **O&M > Alarm** **> Alarms**. On the Alarms page, check whether **ALM-14010 NameService Service Unavailable** is reported. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +5. .. _alm-14000__li25596313162114: + + See **ALM-14010 NameService Service Unavailable** to handle the abnormal NameService services and check whether each NameService service exception alarm is cleared. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +6. .. _alm-14000__li7149629162114: + + On the **O&M > Alarm > Alarms** page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-14000__li44697640162114: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select the following nodes in the required cluster from the **Service**: + + - ZooKeeper + - HDFS + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383957.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..843b81d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst @@ -0,0 +1,129 @@ +:original_name: ALM-14001.html + +.. _ALM-14001: + +ALM-14001 HDFS Disk Usage Exceeds the Threshold +=============================================== + +Description +----------- + +The system checks the HDFS disk usage every 30 seconds and compares the actual HDFS disk usage with the threshold. The HDFS disk usage indicator has a default threshold, this alarm is generated when the value of the disk usage of a Hadoop distributed file system (HDFS) indicator exceeds the threshold. + +To change the threshold, choose **O&M** > **Alarm >** **Thresholds** > *Name of the desired cluster* **>** **HDFS**. + +When the **Trigger Count** is 1, this alarm is cleared when the value of the disk usage of HDFS cluster indicator is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the value of the disk usage of HDFS cluster indicator is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14001 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Writing Hadoop distributed file system (HDFS) data is affected. + +Possible Causes +--------------- + +The disk space configured for the HDFS cluster is insufficient. + +Procedure +--------- + +**Check the disk capacity and delete unnecessary files.** + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. + +#. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Disk**, and select **Percentage of HDFS Capacity** to check whether the HDFS disk usage exceeds the threshold (80% by default). + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`11 `. + +#. .. _alm-14001__li25340212162555: + + In the **Basic Information** area, click the **NameNode(Active)** of the failure NameService and the HDFS WebUI page is displayed. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +#. On the HDFS web user interface (WebUI), click **Datanodes** tab. In the **Block pool used** column, view the disk usage of all DataNodes to check whether the disk usage of any DataNode exceeds the threshold. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`11 `. + +#. Log in to the MRS client node as user **root**. + +#. .. _alm-14001__li13663615162555: + + Run **cd /opt/client** to switch to the client installation directory, and run **source bigdata_env**. If the cluster uses the security mode, perform security authentication. Run **kinit hdfs** and enter the password as prompted. Please obtain the password from the administrator. + +#. Run the **hdfs dfs -rm -r** *file or directory* command to delete unnecessary files. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Expand the system.** + +9. .. _alm-14001__li35348823162555: + + Expand the disk capacity. + +10. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-14001__li18825614162555: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +12. Select the following nodes in the required cluster from the **Service**: + + - ZooKeeper + - HDFS + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383958.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..fe7de35 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst @@ -0,0 +1,132 @@ +:original_name: ALM-14002.html + +.. _ALM-14002: + +ALM-14002 DataNode Disk Usage Exceeds the Threshold +=================================================== + +Description +----------- + +The system checks the DataNode disk usage every 30 seconds and compares the actual disk usage with the threshold. A default threshold range is provided for the DataNode disk usage. This alarm is generated when the DataNode disk usage exceeds the threshold. + +To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **HDFS**. + +If **Trigger Count** is **1**, this alarm is cleared when the DataNode disk usage is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the DataNode disk usage is less than or equal to 80% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Insufficient disk space will impact data write to HDFS. + +Possible Causes +--------------- + +- The disk space configured for the HDFS cluster is insufficient. +- Data skew occurs among DataNodes. + +Procedure +--------- + +**Check whether the cluster disk capacity is full.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, and check whether the **ALM-14001 HDFS Disk Usage Exceeds the Threshold** alarm exists. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14002__li48847933162749: + + Handle the alarm by following the instructions in **ALM-14001 HDFS Disk Usage Exceeds the Threshold** and check whether the alarm is cleared. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`11 `. + +#. .. _alm-14002__li5500455162749: + + Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the balance status of DataNodes.** + +4. .. _alm-14002__li49504103162749: + + On FusionInsight Manager, choose **Hosts**. Check whether the number of DataNodes on each rack is almost the same. If the difference is large, adjust the racks to which DataNodes belong to ensure that the number of DataNodes on each rack is almost the same. Restart the HDFS service for the settings to take effect. + +5. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS**. + +6. In the **Basic Information** area, click **NameNode(Active)**. The HDFS web UI is displayed. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +7. In the **Summary** area of the HDFS web UI, check whether the value of **Max** is 10% greater than that of **Median** in **DataNodes usages**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`11 `. + +8. .. _alm-14002__li25048823162749: + + Balance skewed data in the cluster. Log in to the MRS client as user **root**. If the cluster is in normal mode, run the **su - omm** command to switch to user **omm**. Run the **cd** command to go to the client installation directory and run the **source bigdata_env** command. If the cluster uses the security mode, perform security authentication. Run **kinit hdfs** and enter the password as prompted. Obtain the password from the MRS cluster administrator. + +9. Run the following command to balance data distribution: + + **hdfs balancer -threshold 10** + +10. Wait several minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect the fault information.** + +11. .. _alm-14002__li17443443162749: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +12. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst new file mode 100644 index 0000000..cdb2301 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst @@ -0,0 +1,162 @@ +:original_name: ALM-14003.html + +.. _ALM-14003: + +ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the lost blocks every 30 seconds and compares the actual lost blocks with the threshold. The lost blocks indicator has a default threshold. This alarm is generated when the number of lost HDFS blocks exceeds the threshold. + +To change the threshold, choose **O&M** > **Alarm >** **Thresholds** > *Name of the desired cluster* **>** **HDFS**. + +If **Trigger Count** is **1**, this alarm is cleared when the value of lost HDFS blocks is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the value of lost HDFS blocks is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data stored in HDFS is lost. HDFS may enter the safe mode and cannot provide write services. Lost block data cannot be restored. + +Possible Causes +--------------- + +- The DataNode instance is abnormal. +- Data is deleted. + +Procedure +--------- + +**Check the DataNode instance.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Instance**. + +#. Check whether the **Running** **Status** of all DataNode instance is **Normal**. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`3 `. + +#. .. _alm-14003__li6471267163156: + + Restart the DataNode instance and check whether the DataNode instance restarts successfully. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-14003__li177391556152310: + + Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Delete the damaged file.** + +5. .. _alm-14003__li58241411163156: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **NameNode(Active)**. On the WebUI page of the HDFS, view the information about lost blocks. + + .. note:: + + - If a block is lost, a line in red is displayed on the WebUI. + - By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +6. The user checks whether the file containing the lost data block is useful. + + .. note:: + + Files generated in directories **/mr-history**, **/tmp/hadoop-yarn**, and **/tmp/logs** during MapReduce task execution are unnecessary. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +7. .. _alm-14003__li7098948163156: + + The user checks whether the file containing the lost data block is backed up. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`11 `. + +8. .. _alm-14003__li2696171714538: + + Log in to the HDFS client as user **root**. The user password is defined by the user before the installation. Contact the MRS cluster administrator to obtain the password. Run the following commands: + + - Security mode: + + **cd** *Client installation directory* + + **source bigdata_env** + + **kinit hdfs** + + - Normal mode: + + **su - omm** + + **cd** *Client installation directory* + + **source bigdata_env** + +9. On the node client, run **hdfs fsck / -delete** to delete the lost file. If the file where the lost block is located is a useful file, you need to write the file again to restore the data. + + .. note:: + + Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation. + +10. Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect the fault information.** + +11. .. _alm-14003__li19356361163156: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +12. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383960.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst new file mode 100644 index 0000000..2bfea31 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst @@ -0,0 +1,139 @@ +:original_name: ALM-14006.html + +.. _ALM-14006: + +ALM-14006 Number of HDFS Files Exceeds the Threshold +==================================================== + +Description +----------- + +The system periodically checks the number of HDFS files every 30 seconds and compares the number of HDFS files with the threshold. This alarm is generated when the system detects that the number of HDFS files exceeds the threshold. + +If **Trigger Count** is **1**, this alarm is cleared when the number of HDFS files is less than or equal to the threshold. If **Trigger Count** is greater than **1**, this alarm is cleared when the number of HDFS files is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14006 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +Disk storage space is insufficient, which may result in data import failure. The performance of the HDFS system is affected. + +Possible Causes +--------------- + +The number of HDFS files exceeds the threshold. + +Procedure +--------- + +**Check the number of files in the system.** + +#. On FusionInsight Manager, check the number of HDFS files. Specifically, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **File and Block**, and select **HDFS File** and **Total Blocks**. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations** > **All Configurations**, and search for the **GC_OPTS** parameter under **NameNode**. +#. Configure the threshold of the number of configuration file objects. Specifically, change the value of **Xmx** (GB) in the **GC_OPTS** parameter. The threshold (specified by y) is calculated as follows: y = 0.2007 x Xmx - 0.6312, where x indicates the memory capacity Xmx (GB) and y indicates the number of files (unit: kW). Adjust the memory size as required. +#. Confirm that the value of **GC_PROFILE** is **custom** so that the **GC_OPTS** configuration takes effect. Click **Save** and choose **More** > **Restart Instance** to restart the service. +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether needless files exist in the system.** + +6. .. _alm-14006__li57477018164725: + + Log in to the HDFS client as user **root**. Run **cd** to switch to the client installation directory, and run **source bigdata_env** to configure the environment variables. + + If the cluster uses the security mode, perform security authentication. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the MRS cluster administrator. + +7. Run **hdfs dfs -ls** *file or directory* to check whether the files in the directory can be deleted. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-14006__li46417503164725: + + Run the **hdfs dfs -rm -r** *file or directory path* command. After deleting unnecessary files, wait until the files are retained in the recycle bin for a period longer than the value of **fs.trash.interval** on the NameNode. Then check whether the alarm is cleared. + + .. note:: + + Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-14006__li15104347164725: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +10. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +**Configuration rules of the NameNode JVM parameter** + +Default value of the NameNode JVM parameter **GC_OPTS**: + +-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=3072 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Djava.io.tmpdir=${Bigdata_tmp_dir} + +The number of NameNode files is proportional to the used memory size of the NameNode. When file objects change, you need to change **-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M** in the default value. The following table lists the reference values. + +.. table:: **Table 1** NameNode JVM configuration + + +------------------------+------------------------------------------------------+ + | Number of File Objects | Reference Value | + +========================+======================================================+ + | 10,000,000 | -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M | + +------------------------+------------------------------------------------------+ + | 20,000,000 | -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G | + +------------------------+------------------------------------------------------+ + | 50,000,000 | -Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G | + +------------------------+------------------------------------------------------+ + | 100,000,000 | -Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G | + +------------------------+------------------------------------------------------+ + | 200,000,000 | -Xms96G -Xmx96G -XX:NewSize=9G -XX:MaxNewSize=9G | + +------------------------+------------------------------------------------------+ + | 300,000,000 | -Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G | + +------------------------+------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0269383961.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14007_namenode_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14007_namenode_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..22b5150 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14007_namenode_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,143 @@ +:original_name: ALM-14007.html + +.. _ALM-14007: + +ALM-14007 NameNode Heap Memory Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the HDFS NameNode Heap Memory usage every 30 seconds and compares the actual Heap memory usage with the threshold. The HDFS NameNode Heap Memory usage has a default threshold. This alarm is generated when the HDFS NameNode Heap Memory usage exceeds the threshold. + +You can change the threshold in **O&M** > **Alarm >** **Thresholds** > *Name of the desired cluster* **>** **HDFS**. + +When the **Trigger Count** is 1, this alarm is cleared when the HDFS NameNode Heap memory usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the HDFS NameNode Heap memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14007 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The HDFS NameNode Heap Memory usage is too high, which affects the data read/write performance of the HDFS. + +Possible Causes +--------------- + +The HDFS NameNode Heap Memory is insufficient. + +Procedure +--------- + +**Delete unnecessary files.** + +#. Log in to the HDFS client as user **root**. Run **cd** to switch to the client installation directory, and run **source bigdata_env**. + + If the cluster uses the security mode, perform security authentication. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the administrator. + +#. Run the **hdfs dfs -rm -r** *file or directory* command to delete unnecessary files. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the NameNode JVM memory usage and configuration.** + +4. .. _alm-14007__li15187254165341: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. + +5. In the **Basic Information** area, click **NameNode(Active)** to go to the HDFS WebUI. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +6. .. _alm-14007__li13697230165341: + + On the HDFS WebUI, click the **Overview** tab. In **Summary**, check the numbers of files, directories, and blocks in the HDFS. + +7. .. _alm-14007__li46442940165341: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. In **Search**, enter **GC_OPTS** to check the **GC_OPTS** memory parameter of **HDFS->NameNode**. + +**Adjust the configuration in the system.** + +8. Check whether the memory is configured properly based on the number of files in :ref:`6 ` and the NameNode Heap Memory parameters in :ref:`7 `. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + + .. note:: + + The recommended mapping between the number of HDFS file objects (filesystem objects = files + blocks) and the JVM parameters configured for NameNode is as follows: + + - If the number of file objects reaches 10,000,000, you are advised to set the JVM parameters as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the number of file objects reaches 20,000,000, you are advised to set the JVM parameters as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + - If the number of file objects reaches 50,000,000, you are advised to set the JVM parameters as follows: -Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G + - If the number of file objects reaches 100,000,000, you are advised to set the JVM parameters as follows: -Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G + - If the number of file objects reaches 200,000,000, you are advised to set the JVM parameters as follows: -Xms96G -Xmx96G -XX:NewSize=9G -XX:MaxNewSize=9G + - If the number of file objects reaches 300,000,000, you are advised to set the JVM parameters as follows: -Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G + +9. .. _alm-14007__li14671769165341: + + Modify the heap memory parameters of the NameNode based on the mapping between the number of file objects and the memory. Click **Save** and choose **Dashboard** > **More** > **Restart Service**. + +10. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-14007__li58431113165341: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +12. Select the following nodes in the required cluster from the **Service**: + + - ZooKeeper + - HDFS + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383962.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14008_datanode_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14008_datanode_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..bd3d676 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14008_datanode_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,136 @@ +:original_name: ALM-14008.html + +.. _ALM-14008: + +ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the HDFS DataNode Heap Memory usage every 30 seconds and compares the actual Heap Memory usage with the threshold. The HDFS DataNode Heap Memory usage has a default threshold. This alarm is generated when the HDFS DataNode Heap Memory usage exceeds the threshold. + +You can change the threshold in **O&M** > **Alarm >** **Thresholds** > *Name of the desired cluster* **>** **HDFS**. + +When the **Trigger Count** is 1, this alarm is cleared when the HDFS DataNode Heap Memory usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the HDFS DataNode Heap Memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The HDFS DataNode Heap Memory usage is too high, which affects the data read/write performance of the HDFS. + +Possible Causes +--------------- + +The HDFS DataNode Heap Memory is insufficient. + +Procedure +--------- + +**Delete unnecessary files.** + +#. Log in to the HDFS client as user **root**. Run **cd** to switch to the client installation directory, and run **source bigdata_env**. + + If the cluster uses the security mode, perform security authentication. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the administrator. + +#. Run the **hdfs dfs -rm -r** *file or directory* command to delete unnecessary files. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the DataNode JVM memory usage and configuration.** + +4. .. _alm-14008__li552961441706: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. + +5. In the **Basic Information** area, click **NameNode(Active)** to go to the HDFS WebUI. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +6. .. _alm-14008__li2292511706: + + On the HDFS WebUI, click the **DataNodes** tab, and check the number of blocks of all DataNodes related to the alarm. + +7. .. _alm-14008__li421758201706: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All Configurations**. In **Search**, enter **GC_OPTS** to check the GC_OPTS memory parameter of **HDFS->DataNode**. + +**Adjust the configuration in the system.** + +8. Check whether the memory is configured properly based on the number of block in :ref:`6 ` and the DataNode Heap Memory parameters in :ref:`7 `. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + + .. note:: + + The mapping between the average number of blocks of a DataNode instance and the DataNode memory is as follows: + + - If the average number of blocks of a DataNode instance reaches 2,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the average number of blocks of a DataNode instance reaches 5,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + +9. .. _alm-14008__li84133131706: + + Modify the heap memory parameters of the DataNode based on the mapping between the number of blocks and the memory. Click **Save** and choose **Dashboard** > **More** > **Restart Service**. + +10. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-14008__li435105481706: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +12. Select **HDFS** in the required cluster from the **Service**. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383963.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14009_number_of_dead_datanodes_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14009_number_of_dead_datanodes_exceeds_the_threshold.rst new file mode 100644 index 0000000..5d1038a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14009_number_of_dead_datanodes_exceeds_the_threshold.rst @@ -0,0 +1,191 @@ +:original_name: ALM-14009.html + +.. _ALM-14009: + +ALM-14009 Number of Dead DataNodes Exceeds the Threshold +======================================================== + +Description +----------- + +The system periodically detects the number of dead DataNodes in the HDFS cluster every 30 seconds, and compares the number with the threshold. The number of DataNodes in the Dead state has a default threshold. This alarm is generated when the number exceeds the threshold. + +You can change the threshold in **O&M** > **Alarm >** **Thresholds** > *Name of the desired cluster* **>** **HDFS**. + +When the **Trigger Count** is 1, this alarm is cleared when the number of Dead DataNodes is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the number of Dead DataNodes is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14009 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +DataNodes that are in the Dead state cannot provide HDFS services. + +Possible Causes +--------------- + +- DataNodes are faulty or overloaded. +- The network between the NameNode and the DataNode is disconnected or busy. +- NameNodes are overloaded. +- The NameNodes are not restarted after the DataNode is deleted. + +Procedure +--------- + +**Check whether DataNodes are faulty.** + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. The **HDFS Status** page is displayed. + +#. In the **Basic Information** area, click **NameNode(Active)** to go to the HDFS WebUI. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +#. On the HDFS WebUI, click the **Datanodes** tab. In the **In operation** area, click **Filter** to check whether **down** is in the drop-down list. + + - If yes, select **down**, record the information about the filtered DataNodes, and go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-14009__li4499900717545: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **HDFS** > **Instance** to check whether recorded DataNodes exist in the instance list. + + - If all recorded DataNodes exist, go to :ref:`5 `. + - If none of the recorded DataNodes exists, go to :ref:`6 `. + - If some of the recorded DataNodes exist, go to :ref:`7 `. + +#. .. _alm-14009__li22951519113013: + + Locate the DataNode instance, click **More** > **Restart Instance** to restart it and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +#. .. _alm-14009__li4226377546: + + Select all NameNode instances, choose **More** > **Instance Rolling Restart** to restart them and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +#. .. _alm-14009__li992618717545: + + Select all NameNode instances, choose **More** > **Instance Rolling Restart** to restart them. Locate the DataNode instance, click **More** > **Restart Instance** to restart it and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check the status of the network between the NameNode and the DataNode.** + +8. .. _alm-14009__li2034924617545: + + Log in to the faulty DataNode on the management page as user **root**, and run the **ping** *IP address of the NameNode* command to check whether the network between the DataNode and the NameNode is abnormal. + + On the FusionInsight Manager page, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance**. In the instance list, view the service plane IP address of the faulty DataNode. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`10 `. + +9. .. _alm-14009__li3193609617545: + + Rectify the network fault, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Check whether the DataNode is overloaded.** + +10. .. _alm-14009__li4029888217545: + + On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** and check whether the alarm **ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold** exists. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`13 `. + +11. .. _alm-14009__li3775267317545: + + See **ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold** to handle the alarm and check whether the alarm is cleared. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-14009__li4983258617545: + + Check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Check whether the NameNode is overloaded.** + +13. .. _alm-14009__li2641038017545: + + On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** and check whether the alarm **ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold** exists. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`16 `. + +14. .. _alm-14009__li1070095917545: + + See **ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold** to handle the alarm and check whether the alarm is cleared. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`16 `. + +15. .. _alm-14009__li5612534017545: + + Check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +**Collect fault information.** + +16. .. _alm-14009__li4607504917545: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +17. Select **HDFS** in the required cluster from the **Service**. + +18. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +19. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383964.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst new file mode 100644 index 0000000..0f040b3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst @@ -0,0 +1,208 @@ +:original_name: ALM-14010.html + +.. _ALM-14010: + +ALM-14010 NameService Service Is Abnormal +========================================= + +Description +----------- + +The system checks the NameService service status every 180 seconds. This alarm is generated when the NameService service is unavailable. + +This alarm is cleared when the NameService service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14010 Major Yes +======== ============== ========== + +Parameters +---------- + ++-----------------+-------------------------------------------------------------+ +| Name | Meaning | ++=================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +HDFS fails to provide services for upper-layer components based on the NameService service, such as HBase and MapReduce. As a result, users cannot read or write files. + +Possible Causes +--------------- + +- The KrbServer service is abnormal. +- The JournalNode is faulty. +- The DataNode is faulty. +- The disk capacity is insufficient. +- The NameNode enters safe mode. + +Procedure +--------- + +**Check the KrbServer service status.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Check whether the KrbServer service exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`6 `. + +#. .. _alm-14010__li4671216717852: + + Click **KrbServer**. + +#. Click **Instances**. On the KrbServer management page, select the faulty instance, and choose **More** > **Restart Instance**. Check whether the instance successfully restarts. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`24 `. + +#. .. _alm-14010__li1076710217852: + + Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check the JournalNode instance status.** + +6. .. _alm-14010__li2979505817852: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. + +7. Choose **HDFS** > **Instances**. + +8. Check whether the **Running Status** of the JournalNode is **Normal**. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`9 `. + +9. .. _alm-14010__li34233917852: + + Select the faulty JournalNode, and choose **More** > **Restart Instance**. Check whether the JournalNode successfully restarts. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`24 `. + +10. .. _alm-14010__li136606617852: + + Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Check the DataNode instance status.** + +11. .. _alm-14010__li1229459717852: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS**. + +12. Click **Instances** and check whether **Running Status** of all DataNodes is **Normal**. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`13 `. + +13. .. _alm-14010__li6039615117852: + + Click **Instances**. On the DataNode management page, select the faulty instance, and choose **More** > **Restart Instance**. Check whether the DataNode successfully restarts. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-14010__li2920958817852: + + Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Check disk status.** + +15. .. _alm-14010__li6155970417852: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Host**. + +16. In the **Disk** column, check whether the disk space is insufficient. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`19 `. + +17. .. _alm-14010__li3082265617852: + + Expand the disk capacity. + +18. Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +**Check whether NameNode is in the safe mode.** + +19. .. _alm-14010__li6295063617852: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS**. Click **NameNode(Active)** of the abnormal NameService. The NameNode web UI is displayed. + + .. note:: + + By default, the admin user does not have the management rights of other components. If the page cannot be opened or the content is not completely displayed due to insufficient permission when you access the native page of a component, you can manually create a user with the management rights of the corresponding component to log in to the component. + +20. On the NameNode web UI, check whether "Safe mode is ON." is displayed. + + Information behind **Safe mode is ON** is alarm information and is displayed based actual conditions. + + - If yes, go to :ref:`21 `. + - If no, go to :ref:`24 `. + +21. .. _alm-14010__li5459096817852: + + Log in to the client as user **root**. Run the **cd** command to go to the client installation directory and run the **source bigdata_env** command. If the cluster uses the security mode, perform security authentication. Run the **kinit hdfs** command and enter the password as prompted. The password can be obtained from the MRS cluster administrator. If the cluster uses the non-security mode, log in as user **omm** and run the command. Ensure that user **omm** has the client execution permission. + +22. Run **hdfs dfsadmin -safemode leave**. + +23. Choose **O&M** > **Alarm** > **Alarms** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`24 `. + +**Collect the fault information.** + +24. .. _alm-14010__li5097747017852: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +25. In the **Service** area, select the following nodes of the desired cluster. + + - ZooKeeper + - HDFS + +26. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +27. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895680.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14011_datanode_data_directory_is_not_configured_properly.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14011_datanode_data_directory_is_not_configured_properly.rst new file mode 100644 index 0000000..b8af616 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14011_datanode_data_directory_is_not_configured_properly.rst @@ -0,0 +1,191 @@ +:original_name: ALM-14011.html + +.. _ALM-14011: + +ALM-14011 DataNode Data Directory Is Not Configured Properly +============================================================ + +Description +----------- + +The DataNode parameter **dfs.datanode.data.dir** specifies DataNode data directories. This alarm is generated when a configured data directory cannot be created, a data directory uses the same disk as other critical directories in the system, or multiple directories use the same disk immediately. + +This alarm is cleared when the DataNode data directory is configured properly and this DataNode for which the alarm is generated is restarted. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14011 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the DataNode data directory is mounted to the root directory or a critical directory, the disk space of the root directory or critical directory will be used up after long time running and the system will be faulty. + +If the DataNode data directory is not configured properly, HDFS performance will deteriorate. + +Possible Causes +--------------- + +- The DataNode data directory fails to be created. +- The DataNode data directory uses the same disk with critical directories, such as **/** or **/boot**. +- Multiple directories in the DataNode data directory use the same disk. + +Procedure +--------- + +**Check the alarm cause and information about the DataNode for which the alarm is generated.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. In the alarm list, click the alarm. +#. In **HostName** of **Location**, obtain the host name of the DataNode for which the alarm is generated. + +**Delete directories that do not comply with the disk plan from the DataNode data directory.** + +3. Choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance**. In the instance list, click the DataNode instance on the node for which the alarm is generated. + +4. Click **Instance Configurations** and view the value of the DataNode parameter **dfs.datanode.data.dir**. + +5. Check whether all DataNode data directories are consistent with the disk plan. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`9 `. + +6. .. _alm-14011__li2148997785657: + + Modify the DataNode parameter **dfs.datanode.data.dir** and delete the incorrect directories. + +7. Choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** and restart the DataNode instance. + +8. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm-14011__li6692198485657: + + Log in to the DataNode for which the alarm is generated as user **root**. + + - If the alarm cause is "The DataNode data directory fails to be created", go to :ref:`10 `. + - If the alarm cause is "The DataNode data directory uses the same disk with critical directories, such **/** or **/boot**", go to :ref:`17 `. + - If the alarm cause is "Multiple directories in the DataNode data directory uses the same disk", go to :ref:`21 `. + +**Check whether the DataNode data directory fails to be created.** + +10. .. _alm-14011__li6509165485657: + + Run the **su - omm** command to switch to user **omm**. + +11. Run the **ls** command to check whether the directories exist in the DataNode data directory. + + - If yes, go to :ref:`26 `. + - If no, go to :ref:`12 `. + +12. .. _alm-14011__li2608161785657: + + Run the **mkdir** *data directory* command to create the directory and check whether the directory can be successfully created. + + - If yes, go to :ref:`24 `. + - If no, go to :ref:`13 `. + +13. .. _alm-14011__li5784631085657: + + On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** to check whether alarm **ALM-12017 Insufficient Disk Capacity** exists. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-14011__li6502054785657: + + Adjust the disk capacity and check whether alarm **ALM-12017 Insufficient Disk Capacity** is cleared. For details, see **ALM-12017 Insufficient Disk Capacity**. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`15 `. + +15. .. _alm-14011__li3639665285657: + + Check whether user **omm** has the **rwx** or **x** permission of all the upper-layer directories of the directory. (For example, for **/tmp/abc/**, user **omm** has the **x** permission for directory **tmp** and the **rwx** permission for directory **abc**.) + + - If yes, go to :ref:`24 `. + - If no, go to :ref:`16 `. + +16. .. _alm-14011__li6460099185657: + + Run the **chmod u+rwx** *path* or **chmod u+x** *path* command as user **root** to assign the **rwx** or **x** permission of these directories to user **omm**. Then go to :ref:`12 `. + +**Check whether the DataNode data directory use the same disk as other critical directories in the system.** + +17. .. _alm-14011__li6529778285657: + + Run the **df** command to obtain the disk mounting information of each directory in the DataNode data directory. + +18. Check whether the directories mounted to the disk are critical directories, such as **/** or **/boot**. + + - If yes, go to :ref:`19 `. + - If no, go to :ref:`24 `. + +19. .. _alm-14011__li20309815202314: + + Change the value of the DataNode parameter **dfs.datanode.data.dir** and delete the directories that use the same disk as critical directories. + +20. Go to :ref:`24 `. + +**Check whether multiple directories in the DataNode data directory use the same disk.** + +21. .. _alm-14011__li3878673085657: + + Run the **df** command to obtain the disk mounting information of each directory in the DataNode data directory. Record the mounted directory in the command output. + +22. Modify the DataNode node parameters **dfs.datanode.data.dir** to reserve only one directory among the directories that mounted to the same disk directory. + +23. Go to :ref:`24 `. + +**Restart the DataNode and check whether the alarm is cleared.** + +24. .. _alm-14011__li5208654485657: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** and restart the DataNode instance + +25. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`26 `. + +**Collect fault information.** + +26. .. _alm-14011__li3359588885657: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +27. Select **HDFS** in the required cluster from the **Service**. + +28. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +29. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383966.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14012_journalnode_is_out_of_synchronization.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14012_journalnode_is_out_of_synchronization.rst new file mode 100644 index 0000000..82a2cc3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14012_journalnode_is_out_of_synchronization.rst @@ -0,0 +1,150 @@ +:original_name: ALM-14012.html + +.. _ALM-14012: + +ALM-14012 JournalNode Is Out of Synchronization +=============================================== + +Description +----------- + +On the active NameNode, the system checks the data consistency of all JournalNodes in the cluster every 5 minutes. This alarm is generated when the data on a JournalNode is inconsistent with the data on the other JournalNodes. + +This alarm is cleared in 5 minutes after the data on JournalNodes is consistent. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14012 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-----------------+-------------------------------------------------------------+ +| Name | Meaning | ++=================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +When a JournalNode is working incorrectly, the data on the node becomes inconsistent with that on the other JournalNodes. If data on more than half of JournalNodes is inconsistent, the NameNode cannot work correctly, making the HDFS service unavailable. + +Possible Causes +--------------- + +- The JournalNode instance does not exist (deleted or migrated). +- The JournalNode instance has not been started or has been stopped. +- The JournalNode instance is working incorrectly. +- The network of the JournalNode is unreachable. + +Procedure +--------- + +**Check whether the JournalNode instance has been started up.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. In the alarm list, click the alarm. + +#. Check **Location** and obtain the IP address of the JournalNode for which the alarm is generated. + +#. Choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance**. In the instance list, check whether the JournalNode instance exists on the node for which the alarm is generated. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14012__li184781433124215: + + Choose **O&M** > **Alarm** > **Alarms**. In the alarm list, click **Clear** in the **Operation** column of the alarm. In the dialog box that is displayed, click **OK**. No further action is needed. + +#. .. _alm-14012__li15222114643411: + + Click the JournalNode instance and check whether its **Configuration Status** is **Synchronized**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +#. .. _alm-14012__li40718266973: + + Select the JournalNode instance and choose **Start Instance** to start the instance. + +#. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Check whether the JournalNode instance is working correctly.** + +8. .. _alm-14012__li28800798973: + + Check whether **Running Status** of the JournalNode instance is **Normal**. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`9 `. + +9. .. _alm-14012__li57816175973: + + Select the JournalNode instance and choose **More** > **Restart Instance** to start the instance. + +10. After 5 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Check whether the network of the JournalNode is reachable.** + +11. .. _alm-14012__li3722637973: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** to check the service IP address of the active NameNode. + +12. Log in to the active NameNode as user **root**. + +13. Run the **ping** command to check whether a timeout occurs or the network is unreachable between the active NameNode and the JournalNode. + + **ping** *service IP address of the JournalNode* + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-14012__li25467007973: + + Contact the network administrator to rectify the network fault and check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Collect fault information.** + +15. .. _alm-14012__li43402084973: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +16. Select **HDFS** in the required cluster from the **Service**. + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383967.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14013_failed_to_update_the_namenode_fsimage_file.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14013_failed_to_update_the_namenode_fsimage_file.rst new file mode 100644 index 0000000..0dac258 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14013_failed_to_update_the_namenode_fsimage_file.rst @@ -0,0 +1,228 @@ +:original_name: ALM-14013.html + +.. _ALM-14013: + +ALM-14013 Failed to Update the NameNode FsImage File +==================================================== + +Description +----------- + +HDFS metadata is stored in the FsImage file of the NameNode data directory, which is specified by the **dfs.namenode.name.dir** configuration item. The standby NameNode periodically combines existing FsImage files and Editlog files stored in the JournalNode to generate a new FsImage file, and then pushes the new FsImage file to the data directory of the active NameNode. This period is specified by the **dfs.namenode.checkpoint.period** configuration item of HDFS. The default value is 3600s, namely, one hour. If the FsImage file in the data directory of the active NameNode is not updated, the HDFS metadata combination function is abnormal and requires rectification. + +On the active NameNode, the system checks the FsImage file information every five minutes. This alarm is generated when no FsImage file is generated within three combination periods. + +This alarm is cleared when a new FsImage file is generated and pushed to the active NameNode, which indicates that the HDFS metadata combination function can be properly used. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14013 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-----------------+-------------------------------------------------------------+ +| Name | Meaning | ++=================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-----------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the FsImage file in the data directory of the active NameNode is not updated, the HDFS metadata combination function is abnormal and requires rectification. If it is not rectified, the Editlog files increase continuously after HDFS runs for a period. In this case, HDFS restart is time-consuming because a large number of Editlog files need to be loaded. In addition, this alarm also indicates that the standby NameNode is abnormal and the NameNode high availability (HA) mechanism becomes invalid. When the active NameNode is faulty, the HDFS service becomes unavailable. + +Possible Causes +--------------- + +- The standby NameNode is stopped. +- The standby NameNode instance is working incorrectly. +- The standby NameNode fails to generate a new FsImage file. +- Space of the data directory on the standby NameNode is insufficient. +- The standby NameNode fails to push the FsImage file to the active NameNode. +- Space of the data directory on the active NameNode is insufficient. + +Procedure +--------- + +**Check whether the standby NameNode is stopped.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. In the alarm list, click the alarm. + +#. View **Location** and obtain the host name of the active NameNode for which the alarm is generated and name of the NameService where the active NameNode resides. + +#. Choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance**, find the standby NameNode instance of the NameService in the instance list, and check whether its **Configuration Status** is **Synchronized**. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14013__li6145479091333: + + Select the standby NameNode instance, choose **Start Instance**, and wait until the startup is complete. + +#. Wait for a NameNode metadata combination period and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether the NameNode instance is working correctly.** + +6. .. _alm-14013__li1445415191333: + + Check whether **Running Status** of the standby NameNode instance is **Normal**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`7 `. + +7. .. _alm-14013__li98451291333: + + Select the standby NameNode instance, choose **More** > **Restart Instance**, and wait until the startup is complete. + +8. Wait for a NameNode metadata combination period and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`30 `. + +**Check whether** **the standby NameNode fails to generate a new FsImage file.** + +9. .. _alm-14013__li3200639691333: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**, and search and obtain the value of **dfs.namenode.checkpoint.period**. This value is the period of NameNode metadata combination. + +10. Choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** and obtain the service IP addresses of the active and standby NameNodes of the NameService for which the alarm is generated. + +11. Click the **NameNode(**\ *xx*\ **,Standy)** and **Instance Configurations** to obtain the value of **dfs.namenode.name.dir**. This value is the FsImage storage directory of the standby NameNode. + +12. Log in to the standby NameNode as user **root** or **omm**. + +13. Go to the FsImage storage directory and check the generation time of the newest FsImage file. + + **cd** *Storage directory of the standby NameNode*\ **/current** + + **stat -c %y $(ls -t \| grep "fsimage_[0-9]*$" \| head -1)** + +14. Run the **date** command to obtain the current system time. + +15. Calculate the time difference between the generation time of the newest FsImage file and the current system time and check whether the time difference is greater than three times of the metadata combination period. + + - If yes, go to :ref:`16 `. + - If no, go to :ref:`20 `. + +16. .. _alm-14013__li4970764591333: + + The metadata combination function of the standby NameNode is faulty. Run the following command to check whether the fault is caused by insufficient storage space. + + Go to the FsImage storage directory and check the size of the newest FsImage file (in MB). + + **cd** *Storage directory of the standby NameNode*\ **/current** + + **du -m $(ls -t \| grep "fsimage_[0-9]*$" \| head -1) \| awk '{print $1}'** + +17. Run the following command to check the available disk space of the standby NameNode (in MB). + + df -m ./ \| awk 'END{print $4}' + +18. Compare the FsImage file size and the available disk space and determine whether another FsImage file can be stored on the disk. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`19 `. + +19. .. _alm-14013__li2030785791333: + + Clear the redundant files on the disk where the directory resides to reserve sufficient space for metadata. After the clearance, wait for a NameNode metadata combination period and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`20 `. + +**Check whether the standby NameNode fails to push the FsImage file to the active NameNode.** + +20. .. _alm-14013__li3432370991333: + + Log in to the standby NameNode as user **root**. + +21. Run the **su - omm** command to switch to user **omm**. + +22. Run the following command to check whether the standby NameNode can push the file to the active NameNode. + + **tmpFile=/tmp/tmp_test_$(date +%s)** + + **echo "test" > $tmpFile** + + **scp $tmpFile** *Service IP address of the active NameNode*\ **:/tmp** + + - If yes, go to :ref:`24 `. + - If no, go to :ref:`23 `. + +23. .. _alm-14013__li1279432791333: + + When the standby NameNode fails to push data to the active NameNode as user **omm**, contact the system administrator to handle the fault. Wait for a NameNode metadata combination period and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`24 `. + +**Check whether space on the data directory of the active NameNode is insufficient.** + +24. .. _alm-14013__li2740963991333: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance**, click the active NameNode of the NameService for which the alarm is generated, and then click **Instance Configurations** to obtain the value of **dfs.namenode.name.dir**. This value is the FsImage storage directory of the active NameNode. + +25. Log in to the active NameNode as user **root** or **omm**. + +26. Go to the FsImage storage directory and check the size of the newest FsImage file (in MB). + + **cd** *Storage directory of the active NameNode*\ **/current** + + **du -m $(ls -t \| grep "fsimage_[0-9]*$" \| head -1) \| awk '{print $1}'** + +27. Run the following command to check the available disk space of the active NameNode (in MB). + + df -m ./ \| awk 'END{print $4}' + +28. Compare the FsImage file size and the available disk space and determine whether another FsImage file can be stored on the disk. + + - If yes, go to :ref:`30 `. + - If no, go to :ref:`29 `. + +29. .. _alm-14013__li1860623691333: + + Clear the redundant files on the disk where the directory resides to reserve sufficient space for metadata. After the clearance, wait for a NameNode metadata combination period and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`30 `. + +**Collect fault information.** + +30. .. _alm-14013__li795604191333: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +31. Select **NameNode** in the required cluster from the **Service**. + +32. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +33. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383968.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14014_namenode_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14014_namenode_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..f000d82 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14014_namenode_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,109 @@ +:original_name: ALM-14014.html + +.. _ALM-14014: + +ALM-14014 NameNode GC Time Exceeds the Threshold +================================================ + +Description +----------- + +The system checks the garbage collection (GC) duration of the NameNode process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14014 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the NameNode process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the NameNode instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **ALM-14014 NameNode GC Time Exceeds the Threshold.** Then check the role name in **Location** and confirm the IP adress of the instance. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** > **NameNode (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Garbage Collection**, and select **NameNode Garbage Collection (GC)** to check the GC duration statistics of the NameNode process collected every minute. + +#. Check whether the GC duration of the NameNode process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-14014__li5224893093121: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations** > **NameNode** > **System** to increase the value of **GC_OPTS** parameter as required. + + .. note:: + + The recommended mapping between the number of HDFS file objects (filesystem objects = files + blocks) and the JVM parameters configured for NameNode is as follows: + + - If the number of file objects reaches 10,000,000, you are advised to set the JVM parameters as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the number of file objects reaches 20,000,000, you are advised to set the JVM parameters as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + - If the number of file objects reaches 50,000,000, you are advised to set the JVM parameters as follows: -Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G + - If the number of file objects reaches 100,000,000, you are advised to set the JVM parameters as follows: -Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G + - If the number of file objects reaches 200,000,000, you are advised to set the JVM parameters as follows: -Xms96G -Xmx96G -XX:NewSize=9G -XX:MaxNewSize=9G + - If the number of file objects reaches 300,000,000, you are advised to set the JVM parameters as follows: -Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G + +#. Save the configuration and restart the NameNode instance. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-14014__li1978585593121: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **NameNode** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383969.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14015_datanode_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14015_datanode_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..2eaaf03 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14015_datanode_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,105 @@ +:original_name: ALM-14015.html + +.. _ALM-14015: + +ALM-14015 DataNode GC Time Exceeds the Threshold +================================================ + +Description +----------- + +The system checks the garbage collection (GC) duration of the DataNode process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14015 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the DataNode process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the DataNode instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms**. On the displayed interface, click the drop-down button of **ALM-14015 DataNode GC Time Exceeds the Threshold**. Then check the role name in **Location** and confirm the IP adress of the instance. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** > **DataNode (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Garbage Collection**, and select **DataNode Garbage Collection (GC)** to check the GC duration statistics of the DataNode process collected every minute. + +#. Check whether the GC duration of the DataNode process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-14015__li1285468393538: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations** > **DataNode** > **System** to increase the value of **GC_OPTS** parameter as required. + + .. note:: + + The mapping between the average number of blocks of a DataNode instance and the DataNode memory is as follows: + + - If the average number of blocks of a DataNode instance reaches 2,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the average number of blocks of a DataNode instance reaches 5,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + +#. Save the configuration and restart the DataNode instance. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-14015__li5362621093538: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **DataNode** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383970.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14016_datanode_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14016_datanode_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..99acc99 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14016_datanode_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,133 @@ +:original_name: ALM-14016.html + +.. _ALM-14016: + +ALM-14016 DataNode Direct Memory Usage Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the direct memory usage of HDFS every 30 seconds. This alarm is generated when the direct memory usage of DataNode instances exceeds the threshold (90% of the maximum memory). + +This alarm is automatically cleared when the direct memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14016 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of DataNode instances is insufficient, a memory overflow may occur and the service breaks down. + +Possible Causes +--------------- + +The direct memory of DataNode instances is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the **Home** page of FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the page that is displayed, click the drop-down list in the row containing **ALM-14016 DataNode Direct Memory Usage Exceeds the Threshold**, and view the role name and IP address of the instance for which the alarm is generated in the **Location** area. + +#. On the **Home** page of FusionInsight Manager, choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **Instance** tab. In the instance list, select **DataNode** (IP address of the instance for which this alarm is generated). Click the drop-down list in the upper right corner of the chart, choose **Customize** > **Resource**, and select **DataNode Memory** to check the direct memory usage. + +#. Check whether the used direct memory of a DataNode instance reaches 90% (default threshold) of the maximum direct memory allocated to it. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-14016__li3399087993722: + + On the **Home** page of FusionInsight Manager, choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **Configuration** tab then the **All Configurations** sub-tab, and select **DataNode** > **System**. Check whether **-XX:MaxDirectMemorySize** exists in the **GC_OPTS** parameter. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-14016__li062164310159: + + Adjust the value of **-XX:MaxDirectMemorySize**. + + a. In **GC_OPTS**, check the value of **-Xmx** and check whether the node memory is sufficient. + + .. note:: + + You can determine whether the node memory is sufficient based on the actual environment. For example, you can use the following method: + + Use the IP address to log in to the instance for which the alarm is generated as user **root** and run the **free -g** command to check the value of **Mem** in the **free** column. The value indicates the available memory of the node. In the following example, the available memory of the node is 4 GB. + + .. code-block:: + + total used free shared buff/cache available + Mem: 112 48 4 10 58 46 + ...... + + If the value of **Mem** is at least that of **-Xmx**, the node memory is sufficient. If the value of **Mem** is less than that of **-Xmx**, the node memory is insufficient. + + - If yes, change the value of **-XX:MaxDirectMemorySize** to that of **-Xmx**. + - If no, increase **-XX:MaxDirectMemorySize** to a value no larger than that of **Mem**. + + b. Save the configuration and restart the DataNode instances. + +#. .. _alm-14016__li111010376180: + + Check whether **ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold** exists. + + - If yes, rectify the fault by referring to **ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold**. + - If no, go to :ref:`7 `. + +#. .. _alm-14016__li5868287393722: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-14016__li5838381193722: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **DataNode** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895680.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14017_namenode_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14017_namenode_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b84b0c2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14017_namenode_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,112 @@ +:original_name: ALM-14017.html + +.. _ALM-14017: + +ALM-14017 NameNode Direct Memory Usage Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the direct memory usage of the HDFS service every 30 seconds. This alarm is generated when the direct memory usage of a NameNode instance exceeds the threshold (90% of the maximum memory). + +The alarm is cleared when the direct memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14017 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the HDFS service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the NameNode instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms.** On the displayed interface, click the drop-down button of **ALM-14017 NameNode Direct Memory Usage Exceeds the Threshold**. Then check the role name in **Location** and confirm the IP adress of the instance. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Instance** > **NameNode (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Resource**, and select **NameNode Memory** to check the direct memory usage. + +#. Check whether the used direct memory of NameNode reaches 90% of the maximum direct memory specified for NameNode by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-14017__li5299688794211: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations** > **NameNode** > **System** to check whether "-XX:MaxDirectMemorySize" exists in the **GC_OPTS** parameter. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-14017__li817315147319: + + In the **GC_OPTS** parameter, delete "-XX:MaxDirectMemorySize". Save the configuration and restart the NameNode instance. + +#. .. _alm-14017__li16393123713315: + + Check whether the **ALM-14007 NameNode Heap Memory Usage Exceeds the Threshold** exists. + + - If yes, handle the alarm by referring to **ALM-14007 NameNode Heap Memory Usage Exceeds the Threshold**. + - If no, go to :ref:`7 `. + +#. .. _alm-14017__li812407194211: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-14017__li1686819594211: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +9. Select **NameNode** in the required cluster from the **Service**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269383972.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14018_namenode_non-heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14018_namenode_non-heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b0b8f71 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14018_namenode_non-heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,145 @@ +:original_name: ALM-14018.html + +.. _ALM-14018: + +ALM-14018 NameNode Non-heap Memory Usage Exceeds the Threshold +============================================================== + +Description +----------- + +The system checks the non-heap memory usage of the HDFS NameNode every 30 seconds and compares the actual usage with the threshold. The non-heap memory usage of the HDFS NameNode has a default threshold. This alarm is generated when the non-heap memory usage of the HDFS NameNode exceeds the threshold. + +Users can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **HDFS** to change the threshold. + +This alarm is cleared when the no-heap memory usage of the HDFS NameNode is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14018 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the memory usage of the HDFS NameNode is too high, data read/write performance of HDFS will be affected. + +Possible Causes +--------------- + +Non-heap memory of the HDFS NameNode is insufficient. + +Procedure +--------- + +**Delete unnecessary files.** + +#. Log in to the HDFS client as user **root**. Run the **cd** command to go to the client installation directory, and run the **source bigdata_env** command. + + If the cluster adopts the security mode, perform security authentication. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the administrator. + +#. Run the **hdfs dfs -rm -r** *file or directory path* command to delete unnecessary files. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the NameNode JVM non-heap memory usage and configuration.** + +4. .. _alm-14018__li6485487494622: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. The HDFS status page is displayed. + +5. In the **Basic Information** area, click **NameNode(Active)**. The HDFS WebUI is displayed. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +6. .. _alm-14018__li3062349394622: + + On the HDFS WebUI, click the **Overview** tab. In **Summary**, check the numbers of files, directories, and blocks in HDFS. + +7. .. _alm-14018__li4379666894622: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. In **Search**, enter **GC_OPTS** to check the **GC_OPTS** non-heap memory parameter of **HDFS->NameNode**. + +**Adjust system configurations.** + +8. Check whether the non-heap memory is properly configured based on the number of file objects in :ref:`6 ` and the non-heap parameters configured for NameNode in :ref:`7 `. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`12 `. + + .. note:: + + The recommended mapping between the number of HDFS file objects (filesystem objects = files + blocks) and the JVM parameters configured for NameNode is as follows: + + - If the number of file objects reaches 10,000,000, you are advised to set the JVM parameters as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the number of file objects reaches 20,000,000, you are advised to set the JVM parameters as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + - If the number of file objects reaches 50,000,000, you are advised to set the JVM parameters as follows: -Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G + - If the number of file objects reaches 100,000,000, you are advised to set the JVM parameters as follows: -Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G + - If the number of file objects reaches 200,000,000, you are advised to set the JVM parameters as follows: -Xms96G -Xmx96G -XX:NewSize=9G -XX:MaxNewSize=9G + - If the number of file objects reaches 300,000,000, you are advised to set the JVM parameters as follows: -Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G + +9. .. _alm-14018__li1465143294622: + + Modify the **GC_OPTS** parameter of the NameNode based on the mapping between the number of file objects and non-heap memory. + +10. Save the configuration and click **Dashboard** > **More** > **Restart Service**. + +11. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Collect fault information.** + +12. .. _alm-14018__li6570062694622: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +13. Select the following services in the required cluster from the **Service**. + + - ZooKeeper + - HDFS + +14. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +15. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417342.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14019_datanode_non-heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14019_datanode_non-heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..7e22851 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14019_datanode_non-heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,141 @@ +:original_name: ALM-14019.html + +.. _ALM-14019: + +ALM-14019 DataNode Non-heap Memory Usage Exceeds the Threshold +============================================================== + +Description +----------- + +The system checks the non-heap memory usage of the HDFS DataNode every 30 seconds and compares the actual usage with the threshold. The non-heap memory usage of the HDFS DataNode has a default threshold. This alarm is generated when the non-heap memory usage of the HDFS DataNode exceeds the threshold. + +Users can choose **O&M > Alarm > Thresholds>** *Name of the desired cluster* **>** **HDFS** to change the threshold. + +This alarm is cleared when the no-heap memory usage of the HDFS DataNode is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14019 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the memory usage of the HDFS DataNode is too high, data read/write performance of HDFS will be affected. + +Possible Causes +--------------- + +Non-heap memory of the HDFS DataNode is insufficient. + +Procedure +--------- + +**Delete unnecessary files.** + +#. Log in to the HDFS client as user **root**. Run the **cd** command to go to the client installation directory, and run the **source bigdata_env** command. + + If the cluster adopts the security mode, perform security authentication. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the administrator. + +#. Run the **hdfs dfs -rm -r** *file or directory path* command to delete unnecessary files. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the DataNode JVM non-heap memory usage and configuration.** + +4. .. _alm-14019__li4596028395356: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. + +5. In the **Basic Information** area, click **NameNode(Active)**. The HDFS WebUI is displayed. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +6. .. _alm-14019__li3578073195356: + + On the HDFS WebUI, click the **Datanodes** tab to view the number of blocks of all DataNodes that report alarms. + +7. .. _alm-14019__li4116310695356: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. In **Search**, enter **GC_OPTS** to check the **GC_OPTS** non-heap memory parameter of **HDFS->DataNode**. + +**Adjust system configurations.** + +8. Check whether the memory is properly configured based on the number of blocks in :ref:`6 ` and the memory parameters configured for DataNode in :ref:`7 `. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`12 `. + + .. note:: + + The mapping between the average number of blocks of a DataNode instance and the DataNode memory is as follows: + + - If the average number of blocks of a DataNode instance reaches 2,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M + - If the average number of blocks of a DataNode instance reaches 5,000,000, the reference values of the JVM parameters of the DataNode are as follows: -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G + +9. .. _alm-14019__li3591807095356: + + Modify the **GC_OPTS** parameter of the DataNode based on the mapping between the number of blocks and memory. + +10. Save the configuration and click **Dashboard** > **More** > **Restart Service**. + +11. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Collect fault information.** + +12. .. _alm-14019__li6341663395356: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +13. Select the following services in the required cluster from the **Service**. + + - ZooKeeper + - HDFS + +14. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +15. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417343.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14020_number_of_entries_in_the_hdfs_directory_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14020_number_of_entries_in_the_hdfs_directory_exceeds_the_threshold.rst new file mode 100644 index 0000000..4aa0768 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14020_number_of_entries_in_the_hdfs_directory_exceeds_the_threshold.rst @@ -0,0 +1,134 @@ +:original_name: ALM-14020.html + +.. _ALM-14020: + +ALM-14020 Number of Entries in the HDFS Directory Exceeds the Threshold +======================================================================= + +Description +----------- + +The system obtains the number of subfiles and subdirectories in a specified directory every hour and checks whether it reaches the percentage of the threshold (the maximum number of subfiles and subdirectories in an HDFS directory, the threshold for triggering an alarm is **90%** by default). If it exceeds the percentage of the threshold, an alarm is triggered. + +When the number of subfiles and subdirectories in the directory the alarm is lower than the percentage of the threshold, the alarm is automatically cleared. When the monitoring switch is disabled, alarms corresponding to all directories are cleared. If a directory is removed from the monitoring list, alarms corresponding to the directory are cleared. + +.. note:: + + - The **dfs.namenode.fs-limits.max-directory-items** parameter specifies the maximum number of subfiles and subdirectories in the HDFS directory. Its default value is **1048576**. If the number of subfiles and subdirectories in a directory exceeds the parameter value, subfiles and subdirectories cannot be created in the directory. + - The **dfs.namenode.directory-items.monitor** parameter specifies the list of directories to be monitored. Its default value is **/tmp,/SparkJobHistory,/mr-history**. + - The **dfs.namenode.directory-items.monitor.enabled** parameter is used to enable or disable the monitoring switch. Its default value is **true**, which means the monitoring switch is enabled by default. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14020 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Directory | Specifies the directory for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the number of entries in the monitored directory exceeds 90% of the threshold, an alarm is triggered, but entries can be added to the directory. Once the maximum threshold is exceeded, entries will fail to be added to the directory. + +Possible Causes +--------------- + +The number of entries in the monitored directory exceeds 90% of the threshold. + +Procedure +--------- + +**Check whether unnecessary files exist in the system.** + +#. Log in to the HDFS client as user **root**. Run the **cd** command to go to the client installation directory, and run the **source bigdata_env** command to set the environment variables. + + If the cluster is in security mode, security authentication is required. + + Run the **kinit hdfs** command and enter the password as prompted. Obtain the password from the administrator. + +#. Run the following command to check whether files and directories in the directory with the alarm can be deleted: + + **hdfs dfs -ls** *Directory with the alarm* + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-14020__li352493139594: + + Run the following command to delete unnecessary files. + + **hdfs dfs -rm -r -f** *File or directory path* + + .. note:: + + Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation. + +#. Wait 1 hour and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the threshold is correctly configured.** + +5. .. _alm-14020__li564838279594: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. Search for the **dfs.namenode.fs-limits.max-directory-items** parameter and check whether the parameter value is appropriate. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`6 `. + +6. .. _alm-14020__li152448229594: + + Increase the parameter value. + +7. Save the configuration and click **Dashboard** > **More** > **Restart Service**. + +8. Wait 1 hour and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-14020__li615368649594: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **HDFS** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417344.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14021_namenode_average_rpc_processing_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14021_namenode_average_rpc_processing_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..95e06b1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14021_namenode_average_rpc_processing_time_exceeds_the_threshold.rst @@ -0,0 +1,165 @@ +:original_name: ALM-14021.html + +.. _ALM-14021: + +ALM-14021 NameNode Average RPC Processing Time Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the average RPC processing time of NameNode every 30 seconds, and compares the actual average RPC processing time with the threshold (default value: 100 ms). This alarm is generated when the system detects that the average RPC processing time exceeds the threshold for several consecutive times (10 times by default). + +You can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **HDFS** to change the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the average RPC processing time of NameNode is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the average RPC processing time of NameNode is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14021 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable. + +Possible Causes +--------------- + +- The CPU performance of NameNode nodes is insufficient and therefore NameNode nodes cannot process messages in a timely manner. +- The configured NameNode memory is too small and frame freezing occurs on the JVM due to frequent full garbage collection. + +- NameNode parameters are not configured properly, so NameNode cannot make full use of system performance. + +Procedure +--------- + +**Obtain alarm information.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. In the alarm list, click the alarm. +#. Check the alarm. Obtain the host name of the NameNode node involved in this alarm from the **HostName** information of **Location**. Then obtain the name of the NameService node involved in this alarm from the **NameServiceName** information of **Location**. + +**Check whether the threshold is too small.** + +3. Check the status of the services that depend on HDFS. Check whether the services run slowly or task execution times out. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`4 `. + +4. .. _alm-14021__li29484482172449: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **RPC**, and select **Average Time of Active NameNode RPC Processing** and click **OK**. + +5. On the **Average Time of Active NameNode RPC Processing** monitoring page, obtain the value of the NameService node involved in this alarm. + +6. On the FusionInsight Manager portal, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* **>** **HDFS**. Locate **Average Time of Active NameNode RPC Processing** and click the **Modify** in the **Operation** column of the default rule. The **Modify Rule** page is displayed. Change **Threshold** to 150% of the peak value within one day before and after the alarm is generated. Click **OK** to save the new threshold. + +7. Wait for 5 minutes and then check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check whether the CPU performance of the NameNode node is sufficient.** + +8. .. _alm-14021__li48203297172449: + + On the FusionInsight Manager portal, click **O&M > Alarm >Alarms** and check whether **ALM-12016 CPU Usage Exceeds the Threshold** is generated for the NameNode node. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +9. .. _alm-14021__li23155373172449: + + Handle **ALM-12016 CPU Usage Exceeds the Threshold** by taking recommended actions. + +10. Wait for 10 minutes and check whether alarm 14021 is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Check whether the memory of the NameNode node is too small.** + +11. .. _alm-14021__li29576569172449: + + On the FusionInsight Manager portal, click **O&M > Alarm >Alarms** and check whether **ALM-14007 HDFS NameNode Heap Memory Usage Exceeds the Threshold** is generated for the NameNode node. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`14 `. + +12. .. _alm-14021__li26363673172449: + + Handle **ALM-14007 HDFS NameNode Heap Memory Usage Exceeds the Threshold** by taking recommended actions. + +13. Wait for 10 minutes and check whether alarm 14021 is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Check whether NameNode parameters are configured properly.** + +14. .. _alm-14021__li41096175172449: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. Search for parameter **dfs.namenode.handler.count** and view its value. If the value is less than or equal to 128, change it to **128**. If the value is greater than 128 but less than 192, change it to **192**. + +15. Search for parameter **ipc.server.read.threadpool.size** and view its value. If the value is less than 5, change it to **5**. + +16. Click **Save** and click **OK**. + +17. On the **Instance** page of HDFS, select the standby NameNode of NameService involved in this alarm and choose **More** > **Restart Instance**. Enter the password and click **OK**. Wait until the standby NameNode is started up. + +18. On the **Instance** page of HDFS, select the active NameNode of NameService involved in this alarm and choose **More** > **Restart Instance**. Enter the password and click **OK**. Wait until the active NameNode is started up. + +19. Wait for 1 hour and then check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`20 `. + +**Collect fault information.** + +20. .. _alm-14021__li59520454172449: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +21. Select the following node in the required cluster from the **Service**. + + - HDFS + +22. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +23. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417365.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14022_namenode_average_rpc_queuing_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14022_namenode_average_rpc_queuing_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..3a20569 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14022_namenode_average_rpc_queuing_time_exceeds_the_threshold.rst @@ -0,0 +1,190 @@ +:original_name: ALM-14022.html + +.. _ALM-14022: + +ALM-14022 NameNode Average RPC Queuing Time Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the average RPC queuing time of NameNode every 30 seconds, and compares the actual average RPC queuing time with the threshold (default value: 200 ms). This alarm is generated when the system detects that the average RPC queuing time exceeds the threshold for several consecutive times (10 times by default). + +You can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **HDFS** to change the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the average RPC queuing time of NameNode is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the average RPC queuing time of NameNode is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14022 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable. + +Possible Causes +--------------- + +- The CPU performance of NameNode nodes is insufficient and therefore NameNode nodes cannot process messages in a timely manner. +- The configured NameNode memory is too small and frame freezing occurs on the JVM due to frequent full garbage collection. +- NameNode parameters are not configured properly, so NameNode cannot make full use of system performance. +- The volume of services that access HDFS is too large and therefore NameNode is overloaded. + +Procedure +--------- + +**Obtain alarm information.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. In the alarm list, click the alarm. +#. Check the alarm. Obtain the alarm generation time from **Generated**. Obtain the host name of the NameNode node involved in this alarm from the **HostName** information of **Location**. Then obtain the name of the NameService node involved in this alarm from the **NameServiceName** information of **Location**. + +**Check whether the threshold is too small.** + +3. Check the status of the services that depend on HDFS. Check whether the services run slowly or task execution times out. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`4 `. + +4. .. _alm-14022__li4999873217318: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **RPC**, and select **Average Time of Active NameNode RPC Queuing** and click **OK**. + +5. On the **Average Time of Active NameNode RPC Queuing** monitoring page, obtain the value of the NameService node involved in this alarm. + +6. On the FusionInsight Manager portal, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* **>** **HDFS**. Locate **Average Time of Active NameNode RPC Queuing** and click the **Modify** in the **Operation** column of the default rule. The **Modify Rule** page is displayed. Change **Threshold** to 150% of the monitored value. Click **OK** to save the new threshold. + +7. Wait for 1 minute and then check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + + **Check whether the CPU performance of the NameNode node is sufficient.** + +8. .. _alm-14022__li6328681517318: + + On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and check whether **ALM-12016 HDFS NameNode Memory Usage Exceeds the Threshold** is generated. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +9. .. _alm-14022__li922016517318: + + Handle **ALM-12016 CPU Usage Exceeds the Threshold** by taking recommended actions. + +10. Wait for 10 minutes and check whether alarm 14022 is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Check whether the memory of the NameNode node is too small.** + +11. .. _alm-14022__li3577444117318: + + On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and check whether **ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold** is generated. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`14 `. + +12. .. _alm-14022__li5900064917318: + + Handle **ALM-14007 CPU Usage Exceeds the Threshold** by taking recommended actions. + +13. Wait for 10 minutes and check whether alarm 14022 is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Check whether NameNode parameters are configured properly.** + +14. .. _alm-14022__li2539715217318: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS** > **Configurations** > **All** **Configurations**. Search for parameter **dfs.namenode.handler.count** and view its value. If the value is less than or equal to 128, change it to **128**. If the value is greater than 128 but less than 192, change it to **192**. + +15. Search for parameter **ipc.server.read.threadpool.size** and view its value. If the value is less than 5, change it to **5**. + +16. Click **Save**, and click **OK**. + +17. On the **Instance** page of HDFS, select the standby NameNode of NameService involved in this alarm and choose **More** > **Restart Instance**. Enter the password and click **OK**. Wait until the standby NameNode is started up. + +18. On the **Instance** page of HDFS, select the active NameNode of NameService involved in this alarm and choose **More** > **Restart Instance**. Enter the password and click **OK**. Wait until the active NameNode is started up. + +19. Wait for 1 hour and then check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`20 `. + +**Check whether the HDFS workload changes and reduce the workload properly.** + +20. .. _alm-14022__li2529838417318: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **HDFS**. Click the drop-down menu in the upper right corner of **Chart**, click **Customize**, select **Average Time of Active NameNode RPC Queuing** and click **OK**. + +21. Click |image1|. The **Details** page is displayed. + +22. Set the monitoring data display period, from 5 days before the alarm generation time to the alarm generation time. Click **OK**. + +23. On the **Average RPC Queuing Time** monitoring page, check whether the point in time when the queuing time increases abruptly exists. + + - If yes, go to :ref:`24 `. + - If no, go to :ref:`27 `. + +24. .. _alm-14022__li6583884617318: + + Confirm and check the point in time. Check whether a new task frequently accesses HDFS and whether the access frequency can be reduced. + +25. If a Balancer task starts at the point in time, stop the task or specify a node for the task to reduce the HDFS workload. + +26. Wait for 1 hour and then check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`27 `. + +**Collect fault information.** + +27. .. _alm-14022__li4075154117318: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +28. Select **HDFS** in the required cluster from the **Service**. + +29. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +30. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417366.png +.. |image2| image:: /_static/images/en-us_image_0269417367.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14023_percentage_of_total_reserved_disk_space_for_replicas_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14023_percentage_of_total_reserved_disk_space_for_replicas_exceeds_the_threshold.rst new file mode 100644 index 0000000..5bcb99f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14023_percentage_of_total_reserved_disk_space_for_replicas_exceeds_the_threshold.rst @@ -0,0 +1,128 @@ +:original_name: ALM-14023.html + +.. _ALM-14023: + +ALM-14023 Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold +==================================================================================== + +Description +----------- + +The system checks the percentage of total reserved disk space for replicas (Total reserved disk space for replicas/(Total reserved disk space for replicas + Total remaining disk space)) every 30 seconds and compares the actual percentage with the threshold (**90%** by default). This alarm is generated when the percentage of total reserved disk space for replicas exceeds the threshold for multiple consecutive times (**Trigger Count**). + +The alarm is cleared in the following two scenarios: The value of **Trigger Count** is **1** and the percentage of total reserved disk space for replicas is less than or equal to the threshold; the value of **Trigger Count** is greater than **1** and the percentage of total reserved disk space for replicas is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14023 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| NameServiceName | Specifies the NameService service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The performance of writing data to HDFS is affected. If all remaining DataNode space is reserved for replicas, writing HDFS data fails. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The disk space configured for the HDFS cluster is insufficient. +- The volume of services that access HDFS is too large and therefore DataNode is overloaded. + +Procedure +--------- + +**Check whether the alarm threshold is appropriate.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **HDFS** > **Disk** > **Percentage of Reserved Space for Replicas of Unused Space** to check whether the alarm threshold is appropriate. (The default threshold is **90%**. Users can change it as required.) + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-14023__li44798865102848: + + Choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **HDFS** > **Disk** > **Percentage of Reserved Space for Replicas of Unused Space** and Click **Modify,** change the threshold based on the actual usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether an alarm indicating insufficient disk space is generated.** + +4. .. _alm-14023__li13034211102848: + + On the FusionInsight Manager portal, check whether **ALM-14001 HDFS Disk Usage Exceeds the Threshold** or **ALM-14002 DataNode Disk Usage Exceeds the Threshold** exists on the **O&M > Alarm > Alarms** page. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +5. .. _alm-14023__li31013859102848: + + Handle the alarm by referring to instructions in **ALM-14001 HDFS Disk Usage Exceeds the Threshold** or **ALM-14002 DataNode Disk Usage Exceeds the Threshold** and check whether the alarm is cleared. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +6. .. _alm-14023__li20775880102848: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Expand the DataNode capacity.** + +7. .. _alm-14023__li16883378102848: + + Expand the DataNode capacity. + +8. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-14023__li35167437102848: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **HDFS** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417368.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14024_tenant_space_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14024_tenant_space_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..98a1024 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14024_tenant_space_usage_exceeds_the_threshold.rst @@ -0,0 +1,117 @@ +:original_name: ALM-14024.html + +.. _ALM-14024: + +ALM-14024 Tenant Space Usage Exceeds the Threshold +================================================== + +Description +----------- + +The system checks the space usage (used space of each directory/space allocated to each directory) of each directory associated with a tenant every hour and compares the space usage of each directory with the threshold set for the directory. This alarm is generated when the space usage exceeds the threshold. + +This alarm is cleared when the space usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14024 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+-----------------------------------------------------------+ +| Name | Meaning | ++===================+===========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| TenantName | Specifies the tenant for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| DirectoryName | Specifies the directory for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| Trigger condition | Specifies the threshold for triggering the alarm. | ++-------------------+-----------------------------------------------------------+ + +Impact on the System +-------------------- + +This alarm is generated if the space usage of the tenant directory exceeds the custom threshold. File writing to the directory is not affected. If the used space exceeds the maximum storage space allocated to the directory, the HDFS fails to write data to the directory. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The space allocated to the tenant is improper. + +Procedure +--------- + +**Check whether the alarm threshold is appropriate.** + +#. View the alarm location information to obtain the tenant name and tenant directory for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose the **Tenant Resources** page, select the tenant for which the alarm is generated, and click **Resources**. Check whether the storage space threshold configured for the tenant directory for which the alarm is generated is proper. (The default value 90% is a proper value. You can set it based on the site requirements.) + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-14024__li195771843121910: + + On the **Resources** page, click **Modify** to modify or delete the storage space threshold. + +#. About one minute later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the space allocated to the tenant is appropriate.** + +5. .. _alm-14024__li125757435197: + + On the FusionInsight Manager portal, choose the **Tenant** **Resources** page, select the tenant for which the alarm is generated, and click **Resources**. Check whether the storage space quota of the tenant directory for which the alarm is generated is proper based on the actual service status of the tenant directory. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-14024__li2022812133204: + + On the **Resources** page, click **Modify** to modify the storage space quota. + +7. About one minute later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-14024__li10103154142014: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +9. Select **HDFS** in the required cluster and **NodeAgent** under **Manager** from the **Service**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417369.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14025_tenant_file_object_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14025_tenant_file_object_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..71ad788 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14025_tenant_file_object_usage_exceeds_the_threshold.rst @@ -0,0 +1,117 @@ +:original_name: ALM-14025.html + +.. _ALM-14025: + +ALM-14025 Tenant File Object Usage Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the file object usage (used file objects of each directory/number of file objects allocated to each directory) of each directory associated with a tenant every hour and compares the file object usage of each directory with the threshold set for the directory. This alarm is generated when the file object usage exceeds the threshold. + +This alarm is cleared when the file object usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +14025 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+-----------------------------------------------------------+ +| Name | Meaning | ++===================+===========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| TenantName | Specifies the tenant for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| DirectoryName | Specifies the directory for which the alarm is generated. | ++-------------------+-----------------------------------------------------------+ +| Trigger condition | Specifies the threshold for triggering the alarm. | ++-------------------+-----------------------------------------------------------+ + +Impact on the System +-------------------- + +This alarm is generated if the usage of file objects in a tenant directory exceeds the custom threshold. File writing to the directory is not affected. If the number of used file objects exceeds the maximum number of file objects allocated to the directory, the HDFS fails to write data to the directory. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The maximum number of file objects allocated to the tenant directory is inappropriate. + +Procedure +--------- + +**Check whether the alarm threshold is appropriate.** + +#. View the alarm location information to obtain the tenant name and tenant directory for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose the **Tenant Resources** page, select the tenant for which the alarm is generated, and click **Resources**. Check whether the file object threshold configured for the tenant directory for which the alarm is generated is proper. (The default value 90% is a proper value. You can set it based on the site requirements.) + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-14025__li195771843121910: + + On the **Resources** page, click **Modify** to modify or delete the file object threshold of the tenant directory for which the alarm is generated. + +#. About one minute later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the maximum number of file objects allocated to the tenant is appropriate.** + +5. .. _alm-14025__li125757435197: + + On the FusionInsight Manager portal, choose the **Tenant** **Resources** page, select the tenant for which the alarm is generated, and click **Resources**. Check whether the maximum number of file objects configured for the tenant directory for which the alarm is generated is proper based on the actual service status of the tenant directory. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-14025__li2022812133204: + + On the **Resources** page, click **Modify** to modify or delete the maximum number of file objects configured for the tenant directory. + +7. About one minute later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-14025__li10103154142014: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +9. Select **HDFS** in the required cluster and **NodeAgent** under **Manager** from the **Service**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417370.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14026_blocks_on_datanode_exceed_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14026_blocks_on_datanode_exceed_the_threshold.rst new file mode 100644 index 0000000..198a01b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14026_blocks_on_datanode_exceed_the_threshold.rst @@ -0,0 +1,141 @@ +:original_name: ALM-14026.html + +.. _ALM-14026: + +ALM-14026 Blocks on DataNode Exceed the Threshold +================================================= + +Description +----------- + +The system checks the number of blocks on each DataNode every 30 seconds. This alarm is generated when the number of blocks on the DataNode exceeds the threshold. + +If **Trigger Count** is **1** and the number of blocks on the DataNode is less than or equal to the threshold, this alarm is cleared. If **Trigger Count** is greater than **1** and the number of blocks on the DataNode is less than or equal to 90% of the threshold, this alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14026 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If this alarm is reported, there are too many blocks on the DataNode. In this case, data writing into the HDFS may fail due to insufficient disk space. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. + +- Data skew occurs among DataNodes. +- The disk space configured for the HDFS cluster is insufficient. + +Procedure +--------- + +**Change the threshold.** + +#. On FusionInsight Manager, choose **Cluster**, click the name of the desired cluster, and choose **HDFS**. Then choose **Configurations** > **All Configurations**. On the displayed page, find the **GC_OPTS** parameter under **HDFS->DataNode**. +#. Set the threshold of the DataNode blocks. Specifically, change the value of **Xmx** of the **GC_OPTS** parameter. **Xmx** specifies the memory, and each GB memory supports a maximum of 500,000 DataNode blocks. Set the memory as required. Confirm that **GC_PROFILE** is set to **custom** and save the configuration. +#. Choose **Cluster**, click the name of the desired cluster, and choose **HDFS** > **Instance**. Select the DataNode instance whose status is **Expired**, click **More**, and select **Restart Instance** to make the **GC_OPTS** configuration take effect. +#. Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether associated alarms are reported.** + +5. .. _alm-14026__li10750133111389: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, and check whether the **ALM-14002 DataNode Disk Usage Exceeds the Threshold** alarm exists. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`8 `. + +6. .. _alm-14026__li5750123115384: + + Handle the alarm by following the instructions in **ALM-14002 DataNode Disk Usage Exceeds the Threshold** and check whether the alarm is cleared. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +7. .. _alm-14026__li10751231113815: + + Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Expand the DataNode capacity.** + +8. .. _alm-14026__li4795431151710: + + Expand the DataNode capacity. + +9. On FusionInsight Manager, wait for 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Collect the fault information.** + +10. .. _alm-14026__li10844183481711: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +11. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +**Configuration rules of the DataNode JVM parameter.** + +Default value of the DataNode JVM parameter **GC_OPTS**: + +-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=2048 + +The average number of blocks stored in each DataNode instance in the cluster is: Number of HDFS blocks x 3/Number of DataNodes. If the average number of blocks changes, you need to change **-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M** in the default value. The following table lists the reference values. + +.. table:: **Table 1** DataNode JVM configuration + + +-------------------------------------------------+----------------------------------------------------+ + | Average Number of Blocks in a DataNode Instance | Reference Value | + +=================================================+====================================================+ + | 2,000,000 | -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M | + +-------------------------------------------------+----------------------------------------------------+ + | 5,000,000 | -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G | + +-------------------------------------------------+----------------------------------------------------+ + +**Xmx** specifies memory which corresponds to the threshold of the number of DataNode blocks, and each GB memory supports a maximum of 500,000 DataNode blocks. Set the memory as required. + +.. |image1| image:: /_static/images/en-us_image_0263895589.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14027_datanode_disk_fault.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14027_datanode_disk_fault.rst new file mode 100644 index 0000000..d3c0ed9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14027_datanode_disk_fault.rst @@ -0,0 +1,118 @@ +:original_name: ALM-14027.html + +.. _ALM-14027: + +ALM-14027 DataNode Disk Fault +============================= + +Description +----------- + +The system checks the disk status on DataNodes every 60 seconds. This alarm is generated when a disk is faulty. + +After all faulty disks on the DataNode are recovered, you need to manually clear the alarm and restart the DataNode. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14027 Major No +======== ============== ========== + +Parameters +---------- + +============== ======================================================= +Name Meaning +============== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Failed Volumes Specifies the list of faulty disks. +============== ======================================================= + +Impact on the System +-------------------- + +If this alarm is reported, there are abnormal disk partitions on the DataNode. This may cause the loss of written files. + +Possible Causes +--------------- + +- The hard disk is faulty. +- The disk permissions are configured improperly. + +Procedure +--------- + +**Check whether a disk alarm is generated.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** and check whether **ALM-12014 Partition Lost** or **ALM-12033 Slow Disk Fault** exists. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14027__li106705312711: + + Rectify the fault by referring to the handling procedure of **ALM-12014 Partition Lost** or **ALM-12033 Slow Disk Fault**. Then, check whether the alarm is cleared. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-14027__li1067073192717: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Modify disk permissions.** + +4. .. _alm-14027__li76681531273: + + Choose **O&M** > **Alarm** > **Alarms** and view **Location** and **Additional Information** of the alarm to obtain the location of the faulty disk. + +5. Log in to the node for which the alarm is generated as user **root**. Go to the directory where the faulty disk is located, and run the **ll** command to check whether the permission of the faulty disk is **711** and whether the user is **omm**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-14027__li188961329122819: + + Modify the permission of the faulty disk. For example, if the faulty disk is **data1**, run the following commands: + + **chown omm:wheel data1** + + **chmod 711 data1** + +7. In the alarm list on Manager, click **Clear** in the **Operation** column of the alarm to manually clear the alarm. Choose **Cluster** > **Services** > **HDFS** > **Instance**, select the DataNode, choose **More** > **Restart Instance**, wait for 5 minutes, and check whether a new alarm is reported. + + - If no, no further action is required. + - If yes, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-14027__li206502049133310: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **HDFS** and **OMS** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system does not automatically clear this alarm and you need to manually clear the alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895589.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14028_number_of_blocks_to_be_supplemented_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14028_number_of_blocks_to_be_supplemented_exceeds_the_threshold.rst new file mode 100644 index 0000000..5a03146 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14028_number_of_blocks_to_be_supplemented_exceeds_the_threshold.rst @@ -0,0 +1,150 @@ +:original_name: ALM-14028.html + +.. _ALM-14028: + +ALM-14028 Number of Blocks to Be Supplemented Exceeds the Threshold +=================================================================== + +Description +----------- + +The system checks the number of blocks to be supplemented every 30 seconds and compares the number with the threshold. The number of blocks to be supplemented has a default threshold. This alarm is generated when the number of blocks to be supplemented exceeds the threshold. + +You can change the threshold specified by **Blocks Under Replicated (NameNode)** by choosing **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **HDFS** > **File and Block**. + +If **Trigger Count** is set to **1** and the number of blocks to be supplemented is less than or equal to the threshold, this alarm is cleared. If **Trigger Count** is greater than **1** and the number of blocks to be supplemented is less than or equal to 90% of the threshold, this alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14028 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data stored in HDFS is lost. HDFS may enter the security mode and cannot provide write services. Lost block data cannot be restored. + +Possible Causes +--------------- + +- The DataNode instance is abnormal. +- Data is deleted. +- The number of replicas written into the file is greater than the number of DataNodes. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, check whether alarm **ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold** is generated. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-14028__li23401293163156: + + Rectify the fault according to the handling procedure of **ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold**. Five minutes later, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +3. .. _alm-14028__li2696171714538: + + Log in to the HDFS client as user **root**. The user password is defined by the user before the installation. Contact the MRS cluster administrator to obtain the password. Run the following commands: + + - Security mode: + + **cd** *Client installation directory* + + **source bigdata_env** + + **kinit hdfs** + + - Normal mode: + + **su - omm** + + **cd** *Client installation directory* + + **source bigdata_env** + +4. Run the **hdfs fsck / >> fsck.log** command to obtain the status of the current cluster. + +5. Run the following command to count the number (*M*) of blocks to be replicated: + + **cat fsck.log \| grep "Under-replicated"** + +6. Run the following command to count the number (*N*) of blocks to be replicated in the **/tmp/hadoop-yarn/staging/** directory: + + **cat fsck.log \| grep "Under replicated" \| grep "/tmp/hadoop-yarn/staging/" \| wc -l** + + .. note:: + + **/tmp/hadoop-yarn/staging/** is the default directory. If the directory is modified, obtain it from the configuration item **yarn.app.mapreduce.am.staging-dir** in the **mapred-site.xml** file. + +7. Check whether the percentage of *N* is greater than 50% (N/M > 50%). + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-14028__li181311850105810: + + Run the following command to reconfigure the number of file replicas in the directory (set the number of file replicas to the number of DataNodes or the default number of file replicas): + + **hdfs dfs -setrep -w** **Number of file replicas**\ **/tmp/hadoop-yarn/staging/** + + .. note:: + + To obtain the default number of file replicas: + + Log in to FusionInsight Manager, choose **Cluster > Services > HDFS > Configurations > All Configurations**, and search for the **dfs.replication** parameter. The value of this parameter is the default number of file replicas. + + Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-14028__li1649292775015: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417373.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14029_number_of_blocks_in_a_replica_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14029_number_of_blocks_in_a_replica_exceeds_the_threshold.rst new file mode 100644 index 0000000..e5308be --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14029_number_of_blocks_in_a_replica_exceeds_the_threshold.rst @@ -0,0 +1,126 @@ +:original_name: ALM-14029.html + +.. _ALM-14029: + +ALM-14029 Number of Blocks in a Replica Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the number of blocks in a single replica every four hours and compares the number with the threshold. There is a threshold for the number of blocks in a single replica. This alarm is generated when the actual number of blocks in a single replica exceeds the threshold. + +This alarm is cleared when the number of blocks to be supplemented is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14029 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| NameServiceName | Specifies the NameService for which the alarm is generated. | ++-------------------+-------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------+ + +Impact on the System +-------------------- + +Replica data is prone to be lost when a node is faulty. Too many files of a single replica affect the security of the HDFS file system. + +Possible Causes +--------------- + +- The DataNode is faulty. +- The disk is faulty. +- Files are written to a single replica. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, check whether alarm **ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold** is generated. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-14029__li23401293163156: + + Rectify the fault according to the handling procedure of **ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold**. In the next detection period, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-14029__li17602112155716: + + Check whether files of a single replica have been written into the service. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-14029__li2696171714538: + + Log in to the HDFS client as user **root**. The user password is defined by the user before the installation. Contact the MRS cluster administrator to obtain the password. Run the following commands: + + - Security mode: + + **cd** *Client installation directory* + + **source bigdata_env** + + **kinit hdfs** + + - Normal mode: + + **su - omm** + + **cd** *Client installation directory* + + **source bigdata_env** + +#. Run the following command on the client node to increase the number of replicas for a single replica file: + + **hdfs dfs -setrep -w** *file replica number* *file name or file path* + +#. In the next detection period, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect the fault information.** + +7. .. _alm-14029__li12256203224411: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +8. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417374.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14030_hdfs_allows_write_of_single-replica_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14030_hdfs_allows_write_of_single-replica_data.rst new file mode 100644 index 0000000..416c77b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-14030_hdfs_allows_write_of_single-replica_data.rst @@ -0,0 +1,78 @@ +:original_name: ALM-14030.html + +.. _ALM-14030: + +ALM-14030 HDFS Allows Write of Single-Replica Data +================================================== + +Description +----------- + +This alarm is generated when **dfs.single.replication.enable** is set to **true**, indicating that HDFS is configured to allow write of single-replica data. + +This alarm is cleared when this function is disabled on HDFS. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14030 Warning Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Data of a single replica may be lost. Therefore, the system does not allow write of single-replica data by default. If this configuration is enabled on HDFS and the number of HDFS replicas configured on the client is 1, single-replica data can be written to HDFS. + +Possible Causes +--------------- + +The HDFS configuration item **dfs.single.replication.enable** is set to **true**. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **Configurations** tab then the **All Configurations** sub-tab. +#. Search for **dfs.single.replication.enable** in the search box, change the value of the configuration item to **false**, and click **Save**. +#. On the **Dashboard** page of the HDFS service, click **More** and select **Service Rolling Restart** in the upper right corner. +#. After the HDFS service is started, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect fault information.** + +5. .. _alm-14030__li5733165943316: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HDFS** for the target cluster. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001383088002.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_maximum_number_allowed_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_maximum_number_allowed_exceeds_the_threshold.rst new file mode 100644 index 0000000..e3f12df --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_maximum_number_allowed_exceeds_the_threshold.rst @@ -0,0 +1,87 @@ +:original_name: ALM-16000.html + +.. _ALM-16000: + +ALM-16000 Percentage of Sessions Connected to the HiveServer to Maximum Number Allowed Exceeds the Threshold +============================================================================================================ + +Description +----------- + +The system detects the percentage of sessions connected to the HiveServer to the maximum number of allowed sessions every 30 seconds. This indicator can be viewed on the **Cluster** > *Name of the desired cluster* > **Services** > **Hive > Instance** > *HiveServer instance*\ **.** This alarm is generated when the percentage exceeds the default value **90%**. + +To change the threshold, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive > Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer**. + +When the **Trigger Count** is 1, this alarm is cleared when the percentage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the percentage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16000 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If a connection alarm is generated, too many sessions are connected to Hive and new connections are unavailable. + +Possible Causes +--------------- + +Too many clients are connected to HiveServer. + +Procedure +--------- + +**Increase the maximum number of connections to Hive.** + +#. On the FusionInsight Manager portal, Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations >All Configurations**. +#. Search for **hive.server.session.control.maxconnections** and increase the value of this parameter. If the value of this parameter is **A**, the threshold is **B**, and the number of sessions connected to the HiveServer is **C**, adjust the value of this parameter according to **A x B > C**. To view the number of sessions connected to the HiveServer, check the value of **Statistics for Sessions of the HiveServer** on the Hive monitoring page. +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-16000__li51857364113837: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +5. Select **Hive** in the required cluster from the **Service**. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417375.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..4741dcc --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst @@ -0,0 +1,126 @@ +:original_name: ALM-16001.html + +.. _ALM-16001: + +ALM-16001 Hive Warehouse Space Usage Exceeds the Threshold +========================================================== + +Description +----------- + +This alarm is generated when the Hive warehouse space usage exceeds the specified threshold (85% by default). The system checks the Hive data warehouse space usage every 30s. The indicator **Percentage of HDFS Space Used by Hive to the Available Space** can be viewed on the Hive service monitoring page. + +To change the threshold, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive > Percentage of HDFS Space Used by Hive to the Available Space**. + +When the **Trigger Count** is 1, this alarm is cleared when the Hive warehouse space usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the Hive warehouse space usage is less than or equal to 90% of the threshold. + +.. note:: + + The administrator can reduce the warehouse space usage by expanding the warehouse capacity or releasing the used space. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16001 Minor Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The system fails to write data, which causes data loss. + +Possible Causes +--------------- + +- The upper limit of the HDFS capacity available for Hive is too small. +- The HDFS space is insufficient. +- Some data nodes break down. + +Procedure +--------- + +**Expand the system configuration.** + +#. Analyze the cluster HDFS capacity usage and increase the upper limit of the HDFS capacity available for Hive. + + Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations > All Configurations**, find **hive.metastore.warehouse.size.percent**, and increase its value so that larger HDFS capacity will be available for Hive. Assume that the value of the configuration item is A, the total HDFS storage space is B, the threshold is C, and the HDFS space used by Hive is D. The adjustment policy is A x B x C > D. The total HDFS storage space can be viewed on the HDFS NameNode page. The HDFS space used by Hive can be viewed on the Hive monitoring page. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +**Expand the system.** + +3. .. _alm-16001__li1104893615539: + + Expand the system. + +4. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the data node is normal.** + +5. .. _alm-16001__li56638164155624: + + On the FusionInsight Manager portal, click **O&M > Alarm > Alarms**. + +6. Check whether "ALM-12006 Node Fault", "ALM-12007 Process Fault", or "ALM-14002 DataNode Disk Usage Exceeds the Threshold" exist. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +7. .. _alm-16001__li46923068155624: + + Clear the alarm by following the steps provided in "ALM-12006 Node Fault", "ALM-12007 Process Fault", and "ALM-14002 DataNode Disk Usage Exceeds the Threshold". + +8. Check whether the alarm is cleared. + +- If yes, no further action is required. +- If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-16001__li3112518015571: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **Hive** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417376.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..7c392d7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst @@ -0,0 +1,156 @@ +:original_name: ALM-16002.html + +.. _ALM-16002: + +ALM-16002 Hive SQL Execution Success Rate Is Lower Than the Threshold +===================================================================== + +Description +----------- + +The system checks the percentage of the HQL statements that are executed successfully in every 30 seconds. The formula is: Percentage of HQL statements that are executed successfully = Number of HQL statements that are executed successfully by Hive in a specified period/Total number of HQL statements that are executed by Hive. This indicator can be viewed on the **Cluster >** *Name of the desired cluster* **> Services** > **Hive > Instance** > *HiveServer instance* **.** The default threshold of the percentage of HQL statements that are executed successfully is **90%**. An alarm is reported when the percentage is lower than the **90%**. Users can view the name of the host where an alarm is generated in the location information about the alarm. The IP address of the host is the IP address of the HiveServer node. + +Users can modify the threshold by choosing **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive** > **Percentage of HQL Statements That Are Executed Successfully by Hive**. + +This alarm is cleared when the execution success rate is higher than 110% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16002 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The system configuration and performance cannot meet service processing requirements. + +Possible Causes +--------------- + +- A syntax error occurs in HQL statements. +- The HBase service is abnormal when a Hive on HBase task is performed. +- The Spark service is abnormal when a Hive on Spark task is performed. +- The dependent basic services, such as HDFS, Yarn, and ZooKeeper, are abnormal. + +Procedure +--------- + +**Check whether the HQL statements comply with syntax.** + +#. On the FusionInsight Manager page, choose **O&M** > **Alarm** to view the alarm details and obtain the node where the alarm is generated. + +#. Use the Hive client to log in to the HiveServer node where an alarm is reported. Query the HQL syntax provided by Apache, and check whether the HQL commands are correct. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + + .. note:: + + To view the user who runs an incorrect statement, you can download the hiveserver audit log file of the HiveServer node where this alarm is generated. **Start Data** and **End Data** are 10 minutes before and after the alarm generation time respectively. Open the log file and search for the **Result=FAIL** keyword to filter the log information about the incorrect statement, and then view the user who runs the incorrect statement according to **UserName** in the log information. + +#. .. _alm-16002__li3343432914456: + + Enter the correct HQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`4 `. + +**Check whether the HBase service is abnormal.** + +4. .. _alm-16002__li2677546914456: + + Check whether an Hive on HBase task is performed with the user who runs the HQL command. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`8 `. + +5. .. _alm-16002__li1989232914456: + + On the FusionInsight Manager page, click **Cluster** > *Name of the desired cluster* > **Services**, check whether the HBase service is normal in the service list. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-16002__li4481323314456: + + Choose **O&M** > **Alarm**, check the related alarms displayed on the alarm page and clear them according to related alarm help. + +7. Enter the correct HQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`8 `. + +**Check whether the HDFS, Yarn, and ZooKeeper are normal.** + +8. .. _alm-16002__li4623094014456: + + On the FusionInsight Manager portal, click **Cluster** > *Name of the desired cluster* > **Services**. + +9. In the service list, check whether the services, such as HDFS, Yarn, and ZooKeeper are normal. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`10 `. + +10. .. _alm-16002__li6532844614456: + + Check the related alarms displayed on the alarm page and clear them according to related alarm help. + +11. Enter the correct HQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-16002__li5821800114456: + + After 1 minute, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect fault information.** + +13. .. _alm-16002__li2812112614456: + + On the FusionInsight Manager home page, choose **O&M** > **Log > Download**. + +14. Select the following nodes in the required cluster from the **Service**: + + - MapReduce + - Hive + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417377.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16003_background_thread_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16003_background_thread_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..5869b81 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16003_background_thread_usage_exceeds_the_threshold.rst @@ -0,0 +1,116 @@ +:original_name: ALM-16003.html + +.. _ALM-16003: + +ALM-16003 Background Thread Usage Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the background thread usage in every 30 seconds. This alarm is generated when the usage of the background thread pool of Hive exceeds the threshold, 90% by default. + +.. note:: + + MRS 3.X supports the multi-instance function. If the multi-instance function is enabled in the cluster and multiple Hive services are installed, determine the Hive service for which the alarm is generated based on the value of **ServiceName** in **Location** of the alarm. For example, if Hive1 service is unavailable, **ServiceName** is set to **Hive1** in **Location**, and the operation object in the handling procedure is changed from Hive to Hive1. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +There are too many background threads, so the newly submitted task cannot run in time. + +Possible Causes +--------------- + +The usage of the background thread pool of Hive is excessively high when: + +- There are many tasks executed in the background thread pool of HiveServer. +- The capacity of the background thread pool of HiveServer is too small. + +Procedure +--------- + +**Check the number of tasks executed in the background thread pool of HiveServer.** + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive**. On the displayed page, click **HiveServer Instance** and check values of **Background Thread Count** and **Background Thread Usage**. + +#. Check whether the number of background threads in the latest half an hour is excessively high. (By default, the queue number is 100, and the thread number is considered as high if it is 90 or larger.) + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`5 `. + +#. .. _alm-16003__li7203188143816: + + Adjust the number of tasks submitted to the background thread pool. (For example, cancel some time-consuming tasks with low performance.) + +#. Check whether the values of Background Thread Count and Background Thread Usage decrease. + + - If it is, go to :ref:`7 `. + - If it is not, go to :ref:`5 `. + +**Check the capacity of the HiveServer background thread pool.** + +5. .. _alm-16003__li1418798143810: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive**. On the displayed page, click **HiveServer Instance** and check values of Background Thread Count and Background Thread Usage. + +6. Increase the value of **hive.server2.async.exec.threads** in the **${BIGDATA_HOME}/FusionInsight_HD\_8.1.0.1/1_23_HiveServer/etc/hive-site.xml** file. For example, increase the value by 20%. + +7. .. _alm-16003__li73422961119: + + Save the modification. + +8. Check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-16003__li3112518015571: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **Hive** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417379.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16004_hive_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16004_hive_service_unavailable.rst new file mode 100644 index 0000000..46832d8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16004_hive_service_unavailable.rst @@ -0,0 +1,218 @@ +:original_name: ALM-16004.html + +.. _ALM-16004: + +ALM-16004 Hive Service Unavailable +================================== + +Description +----------- + +This alarm is generated when the HiveServer service is unavailable. The system checks the HiveServer service status every 60 seconds. + +This alarm is cleared when the HiveServer service is normal. + +.. note:: + + MRS 3.X supports the multi-instance function. If the multi-instance function is enabled in the cluster and multiple Hive service instances are installed, you need to determine the Hive service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the Hive1 service is unavailable, **ServiceName=Hive1** is displayed in **Location**, and the operation object in the procedure needs to be changed from Hive to Hive1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16004 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system cannot provide data loading, query, and extraction services. + +Possible Causes +--------------- + +- Hive service unavailability may be related to the faults of the Hive process as well as basic services, such as ZooKeeper, Hadoop distributed file system (HDFS), Yarn, and DBService. + + - The ZooKeeper service is abnormal. + - The HDFS service is abnormal. + - The Yarn service is abnormal. + - The DBService service is abnormal. + - The Hive service process is abnormal. If the alarm is caused by Hive process fault, the alarm report has a delay of about 5 minutes. + +- The network communication between the Hive and basic services is interrupted. + +Procedure +--------- + +**Check the HiveServer/MetaStore process status.** + +#. On the FusionInsight Manager portal, click **Cluster >** *Name of the desired cluster* **> Services** > **Hive** > **Instance**. In the Hive instance list, check whether the HiveServer or MetaStore instances are in the Unknown state. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm-16004__li45196532141457: + + In the Hive instance list, choose **More** > **Restart Instance** to restart the HiveServer/MetaStore process. + +#. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the ZooKeeper service status.** + +4. .. _alm-16004__li31923589141457: + + On the FusionInsight Manager, check whether the alarm list contains **Process Fault**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`8 `. + +5. .. _alm-16004__li58014365141457: + + In the **Process Fault**, check whether **ServiceName** is **ZooKeeper**. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`8 `. + +6. .. _alm-16004__li1543118141457: + + Rectify the fault by following the steps provided in "ALM-12007 Process Fault". + +7. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check the HDFS service status.** + +8. .. _alm-16004__li41412512141457: + + On the FusionInsight Manager, check whether the alarm list contains **HDFS Service Unavailable**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +9. .. _alm-16004__li66079189141457: + + Rectify the fault by following the steps provided in "ALM-14000 HDFS Service Unavailable". + +10. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Check the Yarn service status.** + +11. .. _alm-16004__li26828739141457: + + In FusionInsight Manager alarm list, check whether **Yarn Service Unavailable** is generated. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`14 `. + +12. .. _alm-16004__li25644284141457: + + Rectify the fault. For details, see "ALM-18000 Yarn Service Unavailable". + +13. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Check the DBService service status.** + +14. .. _alm-16004__li53539591141457: + + In FusionInsight Manager alarm list, check whether **DBService Service Unavailable** is generated. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`17 `. + +15. .. _alm-16004__li41739587141457: + + Rectify the fault. For details, see "ALM-27001 DBService Service Unavailable". + +16. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`17 `. + +**Check the network connection between the Hive and ZooKeeper, HDFS, Yarn, and DBService.** + +17. .. _alm-16004__li44837990141457: + + On the FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Services** > **Hive**. + +18. Click **Instance**. + + The HiveServer instance list is displayed. + +19. Click **Host Name** in the row of **HiveServer**. + + The active HiveServer host status page is displayed. + +20. .. _alm-16004__li19527969141457: + + Record the IP address under **Basic Information**. + +21. Use the IP address obtained in :ref:`20 ` to log in to the host where the active HiveServer runs as user **omm**. + +22. Run the **ping** command to check whether communication between the host that runs the active HiveServer and the hosts that run the ZooKeeper, HDFS, Yarn, and DBService services is normal. (Obtain the IP addresses of the hosts that run the ZooKeeper, HDFS, Yarn, and DBService services in the same way as that for obtaining the IP address of the active HiveServer.) + + - If yes, go to :ref:`25 `. + - If no, go to :ref:`23 `. + +23. .. _alm-16004__li42271322141457: + + Contact the administrator to restore the network. + +24. In the alarm list, check whether **Hive Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`25 `. + +**Collect fault information.** + +25. .. _alm-16004__li18695793141457: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +26. Select the following nodes in the required cluster from the **Service**: + + - ZooKeeper + - HDFS + - Yarn + - DBService + - Hive + +27. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +28. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417380.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16005_the_heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16005_the_heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..d72348c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16005_the_heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst @@ -0,0 +1,120 @@ +:original_name: ALM-16005.html + +.. _ALM-16005: + +ALM-16005 The Heap Memory Usage of the Hive Process Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the Hive service status every 30 seconds. The alarm is generated when the heap memory usage of an Hive service exceeds the threshold (95% of the maximum memory). + +Users can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive** to change the threshold. + +The alarm is cleared when the heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16005 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Hive instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and select the alarm whose **Alarm ID** is **16005**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HiveServer, go to :ref:`2 `. + - If the role for which the alarm is generated is MetaStore, go to :ref:`3 `. + +#. .. _alm-16005__li2900058143018: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Hive** > **Instance** and click the HiveServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **HiveServer Memory Usage Statistics** and click **OK**, check whether the used heap memory of the HiveServer service reaches the threshold(default value: 95%) of the maximum heap memory specified for HiveServer. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16005__li46068501143018: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Instance** and click the MetaStore for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **MetaStore Memory Usage Statistics** and click **OK**, check whether the used heap memory of the MetaStore service reaches the threshold(default value: 95%) of the maximum heap memory specified for MetaStore. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16005__li39802450143018: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations > All Configurations**. Choose **HiveServer/MetaStore** > **JVM**. Adjust the value of **-Xmx** in **HIVE_GC_OPTS/METASTORE_GC_OPTS** as the following rules. Click **Save**. + + .. note:: + + Suggestions for GC parameter settings for the HiveServer: + + - When the heap memory used by the HiveServer process reaches the threshold (default value: 95%) of the maximum heap memory set by the HiveServer process, change the value of **-Xmx** to twice the default value. For example, if **-Xmx** is set to 2GB by default, change the value of **-Xmx** to 4GB. You are advised to change the value of **-Xms** to set the ratio of **-Xms** and **-Xmx** to 1:2 to avoid performance problems when JVM dynamically. On the FusionInsight Manager home page, choose **O&M**> **Alarm**> **Thresholds** > *Name of the desired cluster* **> Hive** > **CPU and Memory** > **HiveServer Heap Memory Usage Statistics (HiveServer)** to view **Threshold**. + + Suggestions for GC parameter settings for the MetaServer: + + - When the heap memory used by the MetaStore process reaches the threshold (default value: 95%) of the maximum heap memory set by the MetaStore process, change the value of **-Xmx** to twice the default value. For example, if **-Xmx** is set to 2GB by default, change the value of **-Xmx** to 4GB. On the FusionInsight Manager home page, choose **O&M**> **Alarm**> **Thresholds** > *Name of the desired cluster* **> Hive** > **CPU and Memory** > **MetaStore Heap Memory Usage Statistics (MetaStore)** to view **Threshold**. + + - You are advised to change the value of **-Xms** to set the ratio of **-Xms** and **-Xmx** to 1:2 to avoid performance problems when JVM dynamically. + +#. Click **More > Restart Service** to restart the service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-16005__li7710755143018: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Hive** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417381.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16006_the_direct_memory_usage_of_the_hive_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16006_the_direct_memory_usage_of_the_hive_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..3bcb90e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16006_the_direct_memory_usage_of_the_hive_process_exceeds_the_threshold.rst @@ -0,0 +1,120 @@ +:original_name: ALM-16006.html + +.. _ALM-16006: + +ALM-16006 The Direct Memory Usage of the Hive Process Exceeds the Threshold +=========================================================================== + +Description +----------- + +The system checks the Hive service status every 30 seconds. The alarm is generated when the direct memory usage of an Hive service exceeds the threshold (95% of the maximum memory). + +Users can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* **> Hive** to change the threshold. + +The alarm is cleared when the direct memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16006 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the direct memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable. + +Possible Causes +--------------- + +The direct memory of the Hive instance on the node is overused or the direct memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check direct memory usage.** + +#. On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and select the alarm whose **Alarm ID** is **16006**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HiveServer, go to :ref:`2 `. + - If the role for which the alarm is generated is MetaStore, go to :ref:`3 `. + +#. .. _alm-16006__li31510133143419: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services > Hive > Instance** and click the HiveServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **HiveServer Memory Usage Statistics** and click **OK**, check whether the used direct memory of the HiveServer service reaches the threshold(default value: 95%) of the maximum direct memory specified for HiveServer. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16006__li39131309143419: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Instance** and click the MetaStore for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **MetaStore Memory Usage Statistics** and click **OK**, check whether the used direct memory of the MetaStore service reaches the threshold(default value: 95%) of the maximum direct memory specified for MetaStore. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16006__li4911009143419: + + On the FusionInsight Manager portal, choose **Cluster** >\ *Name of the desired cluster* > **Services** > **Hive** > **Configurations > All Configurations**. Choose **HiveServer/MetaStore** > **JVM**. Adjust the value of **-XX:MaxDirectMemorySize** in **HIVE_GC_OPTS/METASTORE_GC_OPTS** as the following rules. Click **Save**. + + .. note:: + + Suggestions for GC parameter settings for the HiveServer: + + - It is recommended that you set the value of **-XX:MaxDirectMemorySize** to 1/8 of the value of **-Xmx**. For example, if **-Xmx** is set to 8 GB, **-XX:MaxDirectMemorySize** is set to 1024 MB. If **-Xmx** is set to 4 GB, **-XX:MaxDirectMemorySize** is set to 512 MB. It is recommended that the value of **-XX:MaxDirectMemorySize** be greater than or equal to 512 MB. + + Suggestions for GC parameter settings for the MetaServer: + + - It is recommended that you set the value of **-XX:MaxDirectMemorySize** to 1/8 of the value of **-Xmx**. For example, if **-Xmx** is set to 8 GB, **-XX:MaxDirectMemorySize** is set to 1024 MB. If **-Xmx** is set to 4 GB, **-XX:MaxDirectMemorySize** is set to 512 MB. It is recommended that the value of **-XX:MaxDirectMemorySize** be greater than or equal to 512 MB. + +#. Click **More > Restart Service** to restart the service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-16006__li32472303143419: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Hive** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417382.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16007_hive_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16007_hive_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..6fbbc50 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16007_hive_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,122 @@ +:original_name: ALM-16007.html + +.. _ALM-16007: + +ALM-16007 Hive GC Time Exceeds the Threshold +============================================ + +Description +----------- + +The system checks the garbage collection (GC) time of the Hive service every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (exceeds 12 seconds for three consecutive checks.) To change the threshold, choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive**. This alarm is cleared when the Hive GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16007 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, Hive data read and write are affected. + +Possible Causes +--------------- + +The memory of Hive instances is overused, the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and select the alarm whose **Alarm ID** is **16007**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HiveServer, go to :ref:`2 `. + - If the role for which the alarm is generated is MetaStore, go to :ref:`3 `. + +#. .. _alm-16007__li6180447514380: + + On the FusionInsight Manager portal, choose **Cluster** >\ *Name of the desired cluster* > **Services** > **Hive** > **Instance** and click the HiveServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **GC**, and select **Garbage Collection (GC) Time of HiveServer** and click **OK** to check whether the GC time is longer than 12 seconds. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16007__li3832089314380: + + On the FusionInsight Manager portal, choose **Cluster** >\ *Name of the desired* *cluster* > **Services** > **Hive** > **Instance** and click the MetaStore for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **GC**, and select **Garbage Collection (GC) Time of MetaStore** and click **OK** to check whether the GC time is longer than 12 seconds. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +**Check the current JVM configuration.** + +4. .. _alm-16007__li542936514380: + + On the FusionInsight Manager portal, choose **Cluster** >\ *Name of the desired cluster* > **Services > Hive > Configurations > All Configurations**. Choose **HiveServer/MetaStore** > **JVM**. Adjust the value of **-Xmx** in **HIVE_GC_OPTS/METASTORE_GC_OPTS** as the following rules. Click **Save**. + + .. note:: + + Suggestions for GC parameter settings for the HiveServer: + + - When the Hive GC time exceeds the threshold, change the value of **-Xmx** to twice the default value. For example, if **-Xmx** is set to 2 GB by default, change the value of **-Xmx** to 4 GB. + + - You are advised to change the value of **-Xms** to set the ratio of **-Xms** and **-Xmx** to 1:2 to avoid performance problems when JVM dynamically. + + Suggestions for GC parameter settings for the MetaServer: + + - When the Meta GC time exceeds the threshold, change the value of **-Xmx** to twice the default value. For example, if **-Xmx** is set to 2 GB by default, change the value of **-Xmx** to 4 GB. + + - You are advised to change the value of **-Xms** to set the ratio of **-Xms** and **-Xmx** to 1:2 to avoid performance problems when JVM dynamically. + +5. Click **More > Restart Service** to restart the service. + +6. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-16007__li2731494414380: + + On the FusionInsight Manager portal of active and standby clusters, choose **O&M** > **Log > Download**. + +8. In the **Service**, select **Hive** in the required cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417383.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16008_non-heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16008_non-heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..c8fbc52 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16008_non-heap_memory_usage_of_the_hive_process_exceeds_the_threshold.rst @@ -0,0 +1,122 @@ +:original_name: ALM-16008.html + +.. _ALM-16008: + +ALM-16008 Non-Heap Memory Usage of the Hive Process Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the Hive service status every 30 seconds. The alarm is generated when the non-heap memory usage of an Hive service exceeds the threshold (95% of the maximum memory). + +Users can choose **O&M > Alarm > Thresholds >** *Name of the desired cluster* > **Hive** to change the threshold. + +The alarm is cleared when the non-heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +16008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the non-heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable. + +Possible Causes +--------------- + +The non-heap memory of the Hive instance on the node is overused or the non-heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On the FusionInsight Manager portal, click **O&M > Alarm > Alarms** and select the alarm whose **Alarm ID** is **16008**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HiveServer, go to :ref:`2 `. + - If the role for which the alarm is generated is MetaStore, go to :ref:`3 `. + +#. .. _alm-16008__li54453327144225: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Instance** and click the HiveServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **HiveServer Memory Usage Statistics** and click **OK**, check whether the used non-heap memory of the HiveServer service reaches the threshold(default value: 95%) of the maximum non-heap memory specified for HiveServer. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16008__li31617556144225: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired* *cluster* > **Services** > **Hive** > **Instance** and click the MetaStore for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory**, and select **MetaStore Memory Usage Statistics** and click **OK**, check whether the used non-heap memory of the MetaStore service reaches the threshold(default value: 95%) of the maximum non-heap memory specified for MetaStore. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-16008__li24754013144225: + + On the FusionInsight Manager portal, choose **Cluster** **>** *Name of the desired* *cluster* > **Services** > **Hive** > **Configurations > All Configurations**. Choose **HiveServer/MetaStore** > **JVM**. Adjust the value of **-XX:MaxMetaspaceSize** in **HIVE_GC_OPTS/METASTORE_GC_OPTS** as the following rules. Click **Save**. + + .. note:: + + Suggestions for GC parameter settings for the HiveServer: + + - It is recommended that you set the value of **-XX:MaxMetaspaceSize** to 1/8 of the value of **-Xmx**. For example, if **-Xmx** is set to 2 GB, **-XX:** + + **MaxMetaspaceSize** is set to 256 MB. If **-Xmx** is set to 4 GB, **-XX:MaxMetaspaceSize** is set to 512 MB. + + Suggestions for GC parameter settings for the MetaServer: + + - It is recommended that you set the value of **-XX:MaxMetaspaceSize** to 1/8 of the value of **-Xmx**. For example, if **-Xmx** is set to 2 GB, **-XX:** + + **MaxMetaspaceSize** is set to 256 MB. If **-Xmx** is set to 4 GB, **-XX:MaxMetaspaceSize** is set to 512 MB + +#. Click **More > Restart Service** to restart the service. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-16008__li3071924144225: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Hive** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417384.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16009_map_number_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16009_map_number_exceeds_the_threshold.rst new file mode 100644 index 0000000..59c1d13 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16009_map_number_exceeds_the_threshold.rst @@ -0,0 +1,84 @@ +:original_name: ALM-16009.html + +.. _ALM-16009: + +ALM-16009 Map Number Exceeds the Threshold +========================================== + +Description +----------- + +The system checks the number of HQL maps in every 30 seconds. This alarm is generated if the number exceeds the threshold. By default, **Trigger Count** is set to **3**, and the threshold is 5000. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the number of HQL maps executed on Hive is excessively large, the HQL execution speed is slow, and a large number of resources are occupied. + +Possible Causes +--------------- + +The HQL statements are not the optimal. + +Procedure +--------- + +**Check the number of HQL maps.** + +#. On FusionInsight Manager portal, choose **Cluster** >\ *Name of the desired cluster* > **Services** > **Hive** > **Resource**. Check the HQL statements with the excessively large number (5000 or more) of maps in **HQL Map Count**. + +2. Locate the corresponding HQL statements, optimize them and execute them again. +3. Check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-16009__li1284813578408: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +5. Select **Hive** in the required cluster from the **Service**. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417385.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16045_hive_data_warehouse_is_deleted.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16045_hive_data_warehouse_is_deleted.rst new file mode 100644 index 0000000..9d2682b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16045_hive_data_warehouse_is_deleted.rst @@ -0,0 +1,93 @@ +:original_name: ALM-16045.html + +.. _ALM-16045: + +ALM-16045 Hive Data Warehouse Is Deleted +======================================== + +Description +----------- + +The system checks the Hive data warehouse in every 60 seconds.This alarm is generated when the Hive data warehouse is deleted. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16045 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The default Hive data warehouse is deleted. As a result, creating databases or tables in the default data warehouse fails, and services are affected. + +Possible Causes +--------------- + +Hive periodically checks the status of the default data warehouse and finds that the default data warehouse is deleted. + +Procedure +--------- + +**Check the default Hive data warehouse.** + +#. Log in to the node where the client is located as user **root**. + +2. Run the following command to check whether the **warehouse** directory exists in **hdfs://hacluster/user/**\ **\ **/.Trash/Current/**. + + **hdfs dfs -ls** **hdfs://hacluster/user/**\ **\ **/.Trash/Current/** + + For example, if **user/hive/warehouse** exists: + + .. code-block:: + + host01:/opt/client # hdfs dfs -ls hdfs://hacluster/user/test/.Trash/Current/ + Found 1 items + drwx------ - test hadoop 0 2019-06-17 19:53 hdfs://hacluster/user/test/.Trash/Current/user + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +3. .. _alm-16045__li260541320316: + + By default, there is an automatic recovery mechanism for the data warehouse. You can wait for 5 ~10s to check whether the default data warehouse is restored. If the data warehouse is not recovered, manually run the following command to restore the data warehouse. + + **hdfs dfs -mv hdfs://hacluster/user/**\ **\ **/.Trash/Current/user/hive/warehouse /user/hive/warehouse** + +4. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect fault information**. + +5. .. _alm-16045__li185241657121312: + + Collect related information in the **.Trash/Current/** directory on the client background. + +6. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16046_hive_data_warehouse_permission_is_modified.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16046_hive_data_warehouse_permission_is_modified.rst new file mode 100644 index 0000000..3718420 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16046_hive_data_warehouse_permission_is_modified.rst @@ -0,0 +1,85 @@ +:original_name: ALM-16046.html + +.. _ALM-16046: + +ALM-16046 Hive Data Warehouse Permission Is Modified +==================================================== + +Description +----------- + +The system checks the Hive data warehouse permission in every 60 seconds. This alarm is generated if the permission is modified. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16046 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the permission on the Hive default data warehouse is modified, the permission for users or user groups to create databases or tables in the default data warehouse is changed. + +Possible Causes +--------------- + +Hive periodically checks the status of the default data warehouse and finds that default data warehouse permission is changed. + +Procedure +--------- + +**Check the Hive default data warehouse permission.** + +#. Log in to the node where the client is located as user **root**. + +2. Run the following command to go to the HDFS client installation directory: + + **cd** *Client installation directory* + + **source bigdata_env** + + **kinit** *User who has the supergroup permission* (Skip this step for a common cluster.) + +3. Run the following command to restore the default data warehouse permission: + + - Security mode: **hdfs dfs -chmod 770 hdfs://hacluster/user/hive/warehouse** + - Non-security mode: **hdfs dfs -chmod 777 hdfs://hacluster/user/hive/warehouse** + +4. Check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`5 `. + +**Collect fault information**. + +5. .. _alm-16046__li14150557160: + + Collect related information in the **hdfs://hacluster/user/hive/warehouse** directory on the client background. + +6. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16047_hiveserver_has_been_deregistered_from_zookeeper.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16047_hiveserver_has_been_deregistered_from_zookeeper.rst new file mode 100644 index 0000000..4ed3d86 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16047_hiveserver_has_been_deregistered_from_zookeeper.rst @@ -0,0 +1,79 @@ +:original_name: ALM-16047.html + +.. _ALM-16047: + +ALM-16047 HiveServer Has Been Deregistered from ZooKeeper +========================================================= + +Description +----------- + +The system checks the Hive service every 60 seconds. This alarm is generated when Hive registration information on ZooKeeper is lost or Hive cannot connect to ZooKeeper. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16047 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the Hive configuration cannot be read from ZooKeeper, HiveServer will be unavailable. + +Possible Causes +--------------- + +- The network is disconnected. +- The ZooKeeper instance is abnormal. + +Procedure +--------- + +**Restart related instances.** + +#. Log in to FusionInsight Manager. Choose **O&M** > **Alarm** > **Alarms**, click the drop-down list in the row that contains the alarm, and view role and the IP address of the node for which the alarm is generated in **Location**. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Instance**, select the instance at the IP address for which the alarm is generated, and choose **More** > **Restart Instance**. +#. Wait for 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect the fault information.** + +4. .. _alm-16047__li57092876161840: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +5. Expand the **Service** drop-down list, and select **Hive** for the target cluster. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895811.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16048_tez_or_spark_library_path_does_not_exist.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16048_tez_or_spark_library_path_does_not_exist.rst new file mode 100644 index 0000000..5a34e0c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-16048_tez_or_spark_library_path_does_not_exist.rst @@ -0,0 +1,93 @@ +:original_name: ALM-16048.html + +.. _ALM-16048: + +ALM-16048 Tez or Spark Library Path Does Not Exist +================================================== + +Description +----------- + +The system checks the Tez and Spark library paths every 180 seconds. This alarm is generated when the Tez or Spark library path does not exist. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16048 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Hive on Tez and Hive on Spark functions are affected. + +Possible Causes +--------------- + +The Tez or Spark library path is deleted from the HDFS. + +Procedure +--------- + +**Check the default Hive data warehouse.** + +#. Log in to the node where the client is located as user **root**. + +#. Run the following command to check whether the **tezlib** or **sparklib** directory exists in the **hdfs://hacluster/user/{User name}/.Trash/Current/** director: + + **hdfs dfs -ls hdfs://hacluster/user/**\ **\ **/.Trash/Current/** + + For example, the following information shows that **/user/hive/tezlib/8.1.0.1/** and **/user/hive/sparklib/8.1.0.1/** exist. + + .. code-block:: + + host01:/opt/client # hdfs dfs -ls hdfs://hacluster/user/test/.Trash/Current/ + Found 1 items + drwx------ - test hadoop 0 2019-06-17 19:53 hdfs://hacluster/user/test/.Trash/Current/user + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-16048__li18824736175210: + + Run the following command to restore **tezlib** and **sparklib**. + + **hdfs dfs -mv hdfs://hacluster/user/**\ **\ **/.Trash/Current/user/hive/tezlib/8.1.0.1/tez.tar.gz /user/hive/tezlib/8.1.0.1/tez.tar.gz** + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + + **Collect fault information**. + +#. .. _alm-16048__li1182513369521: + + Collect related information in the **.Trash/Current/** directory on the client background. + +#. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17003_oozie_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17003_oozie_service_unavailable.rst new file mode 100644 index 0000000..4c00006 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17003_oozie_service_unavailable.rst @@ -0,0 +1,223 @@ +:original_name: ALM-17003.html + +.. _ALM-17003: + +ALM-17003 Oozie Service Unavailable +=================================== + +Description +----------- + +The system checks the Oozie service status in every 5 seconds. This alarm is generated when Oozie or a component on which Oozie depends cannot provide services properly. + +This alarm is automatically cleared when the Oozie service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +17003 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Oozie cannot be used to submit jobs. + +Possible Causes +--------------- + +- The DBService service is abnormal or the data of Oozie stored in DBService is damaged. +- The HDFS service is abnormal or the data of Oozie stored in HDFS is damaged. +- The Yarn service is abnormal. +- The Nodeagent process is abnormal. + +Procedure +--------- + +**Query the Oozie service health status code.** + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* >\ **Services** > **Oozie**. Click **oozie** (any one is OK) on the **oozie WebUI**. to go to the Oozie WebUI. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +#. Add **/servicehealth** to the URL in the address box of the browser and access again. The value of **statusCode** is the current Oozie service health status code. + + For example, visit **https://10.10.0.117:20026/Oozie/oozie/130/oozie/servicehealth**. The result is as follows: + + .. code-block:: + + {"beans":[{"name":"serviceStatus","statusCode":0}]} + + If the health status code cannot be displayed or the browser does not respond, the service may be unavailable due to Oozie process fault. See :ref:`13 ` to rectify the fault. + +#. Perform the operations based on the error code. For details, see :ref:`Table 1 `. + + .. _alm-17003__table1418843217821: + + .. table:: **Table 1** Oozie service health status code + + +-------------+------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------+ + | Status Code | Description | Error Cause | Solution | + +=============+====================================+=================================================================================+=============================================+ + | 0 | The service is running properly. | None | None | + +-------------+------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------+ + | 18002 | The DBService service is abnormal. | Oozie fails to connect to DBService or the data stored in DBService is damaged. | See :ref:`4 `. | + +-------------+------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------+ + | 18003 | The HDFS service is abnormal. | Oozie fails to connect to HDFS or the data stored in HDFS is damaged. | See :ref:`7 `. | + +-------------+------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------+ + | 18005 | The MapReduce service is abnormal. | The Yarn service is abnormal. | See :ref:`11 `. | + +-------------+------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------+ + +**Check the DBService service.** + +4. .. _alm-17003__li5899993317821: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services**, and check whether the DBService service is running properly. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`5 `. + +5. .. _alm-17003__li6459530417821: + + Resolve the problem of DBService based on the alarm help and check whether the Oozie alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +6. .. _alm-17003__li2491190317821: + + Log in to the Oozie database to check whether the data is complete. + + a. Log in to the active DBService node as user **root**. + + On the FusionInsight Manager page, choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService > Instance** to view the IP address of the active DBservice node. + + b. Run the following command to log in to the Oozie database: + + **su - omm** + + **source ${BIGDATA_HOME}/FusionInsight_BASE\_8.1.0.1/install/FusionInsight-dbservice-2.7.0/.dbservice_profile** + + **gsql -U** *Username* **-W** *Oozie database password* **-p 20051 -d** *Database name* + + c. After the login is successful, enter **\\d** to check whether there are 15 data tables. + + The Oozie service has 15 data tables by default. If these data tables are deleted or the table structure is modified, the Oozie service may be unavailable. Contact the O&M personnel to back up the data and perform restoration. + +**Check the HDFS service.** + +7. .. _alm-17003__li6587172717821: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services**, and check whether the HDFS service is running properly. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +8. .. _alm-17003__li2988812617821: + + Resolve the problem of HDFS based on the alarm help and check whether the Oozie alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +9. .. _alm-17003__li940532017821: + + Log in to HDFS to check whether the Oozie file directory structure is complete. + + a. Download and install an HDFS client.. + + b. Log in to the client node as user **root** and run the following commands to check whether **/user/oozie/share** exists. + + If the cluster uses the security mode, perform security authentication. + + **kinit admin** + + **hdfs dfs -ls /user/oozie/share** + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`10 `. + +10. .. _alm-17003__li367846717821: + + In the Oozie client installation directory, manually upload the share directory to **/user/oozie** in HDFS, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +**Check the Yarn and MapReduce service.** + +11. .. _alm-17003__li6500500117821: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services**, and check whether the Yarn and MapReduce services are running properly. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`12 `. + +12. .. _alm-17003__li2196836817821: + + Resolve the problem of Yarn and MapReduce based on the alarm help and check whether the Oozie alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +**Check the Oozie process.** + +13. .. _alm-17003__li3460735817821: + + Log in to each node of Oozie as user **root**. + +14. Run the **ps -ef \| grep oozie** command to check whether the Oozie process exists. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`18 `. + +15. .. _alm-17003__li1524116517821: + + Collect fault information in **prestartDetail.log**, **oozie.log**, and **catalina.out** in the Oozie log directory **/var/log/Bigdata/oozie**. If the alarm is not caused by manual misoperation, go to :ref:`16 `. + +**Check the Nodeagent process.** + +16. .. _alm-17003__li3722887217821: + + Log in to each node of Oozie as user **root**. Run the **ps -ef \| grep nodeagent** command to check whether the Nodeagent process exists. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`18 `. + +17. .. _alm-17003__li2866055917821: + + Run the **kill -9** *The process ID of nodeagent* command, wait 10 minutes, and check whether alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +18. .. _alm-17003__li3980393617821: + + Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17004_oozie_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17004_oozie_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..d88a40c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17004_oozie_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-17004.html + +.. _ALM-17004: + +ALM-17004 Oozie Heap Memory Usage Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the heap memory usage of the Oozie service every 60 seconds. The alarm is generated when the heap memory usage of a Metadata instance exceeds the threshold (95% of the maximum memory). The alarm is cleared when the heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +17004 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The heap memory overflow may cause a service breakdown. + +Possible Causes +--------------- + +The heap memory of the Oozie instance is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm** **> Alarms** > **Oozie Heap Memory Usage Exceeds the Threshold** > **Location**. Check the IP address of the instance involved in this alarm. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down menu in the chart area and choose **Customize** > **Memory** > **Oozie Heap Memory Resource Percentage**. Click **OK**. + +#. Check whether the used heap memory of Oozie reaches the threshold (the default value is 95% of the maximum heap memory) specified for Oozie. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-17004__en-us_topic_0070543677_d0e31653: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Configurations > All Configuration\ s**. Set Search **GC_OPTS** in the search box. Increase the value of **-Xmx** as required, and click **Save** > **OK**. + + .. note:: + + Suggestions on GC parameter settings for Oozie: + + You are advised to set **-Xms** and **-Xmx** to the same value to prevent adverse impact on performance when JVM dynamically adjusts the heap memory size. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-17004__en-us_topic_0070543677_d0e31704: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select **Oozie** in the required cluster from the **Service**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0270938821.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17005_oozie_non_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17005_oozie_non_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..2201df4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17005_oozie_non_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,102 @@ +:original_name: ALM-17005.html + +.. _ALM-17005: + +ALM-17005 Oozie Non Heap Memory Usage Exceeds the Threshold +=========================================================== + +Description +----------- + +The system checks the non heap memory usage of Oozie every 30 seconds. This alarm is reported if the non heap memory usage of Oozie exceeds the threshold (80%). This alarm is cleared if the non heap memory usage is lower than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +17005 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Non-heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The non-heap memory of the Oozie instance is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **Oozie Non Heap Memory Usage Exceeds the Threshold**. On the displayed page, check the location information of the alarm. Check the name of the instance host for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Oozie** and click the **Instance** tab. On the displayed page, select the role corresponding to the host name for which the alarm is generated and select **Customize** from the drop-down list in the upper right corner of the chart area. Choose **Memory** and select **Oozie Non Heap Memory Resource Percentage**. Click **OK**. + +#. Check whether the non-heap memory used by Oozie reaches the threshold (80% of the maximum non-heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-17005__l3672051debd1416aa3b54541a7a480cb: + + On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Oozie** and click the **Configurations** and then **All Configurations**. On the displayed page, search for the **GC_OPTS** parameter in the search box and check whether it contains **-XX: MaxMetaspaceSize**. If yes, increase the value of **-XX: MaxMetaspaceSize** based on the site requirements. If no, manually add **-XX: MaxMetaspaceSize** and set its value to 1/8 of the value of **-Xmx**. Click **Save**, and then click **OK** + + .. note:: + + JDK1.8 does not support the **MaxPermSize** parameter. + + Suggestions on GC parameter settings for Oozie: + + Set the value of **-XX:MaxMetaspaceSize** to 1/8 of the value of **-Xmx**. For example, if **-Xmx** is set to 2 GB, **-XX:MaxMetaspaceSize** is set to 256 MB. If **-Xmx** is set to 4 GB, **-XX:MaxMetaspaceSize** is set to 512 MB. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-17005__d0e31729: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Oozie** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895663.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17006_oozie_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17006_oozie_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..c173f7d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17006_oozie_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-17006.html + +.. _ALM-17006: + +ALM-17006 Oozie Direct Memory Usage Exceeds the Threshold +========================================================= + +Description +----------- + +The system checks the direct memory usage of the Oozie service every 30 seconds. The alarm is generated when the direct memory usage of an Oozie instance exceeds the threshold (80% of the maximum memory). The alarm is cleared when the direct memory usage of Oozie is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +17006 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The direct memory overflow may cause a service breakdown. + +Possible Causes +--------------- + +The direct memory of the Oozie instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Oozie Direct Memory Usage Exceeds the Threshold** > **Location**. Check the IP address of the instance involved in this alarm. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down menu in the chart area and choose **Customize** > **Memory** > **Oozie Direct Buffer Resource Percentage**. Click **OK**. + +#. Check whether the used direct memory of Oozie reaches the threshold (the default value is 80% of the maximum direct memory) specified for Oozie. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-17006__la167795d52ba4dad8c39dd9d171ff45a: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Configurations**. Click **All** **Configurations**. Search **GC_OPTS** in the search box. Increase the value of **-XX:MaxDirectMemorySize** as required, and click **Save**. Click **OK**. + + .. note:: + + Suggestions on GC parameter settings for Oozie: + + You are advised to set the value of **-XX:MaxDirectMemorySize** to 1/4 of the value of **-Xmx**. For example, if **-Xmx** is set to 4 GB, **-XX:MaxDirectMemorySize** is set to 1024 MB. If **-Xmx** is set to 2 GB, **-XX:MaxDirectMemorySize** is set to 512 MB. It is recommended that the value of **-XX:MaxDirectMemorySize** be greater than or equal to 512 MB. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-17006__en-us_topic_0070543679_d0e32308: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select **Oozie** in the required cluster from the **Service** drop-down list. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417388.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17007_garbage_collection_gc_time_of_the_oozie_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17007_garbage_collection_gc_time_of_the_oozie_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..ff262e8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-17007_garbage_collection_gc_time_of_the_oozie_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-17007.html + +.. _ALM-17007: + +ALM-17007 Garbage Collection (GC) Time of the Oozie Process Exceeds the Threshold +================================================================================= + +Description +----------- + +The system checks GC time of the Oozie process every 60 seconds. The alarm is generated when GC time of the Oozie process exceeds the threshold (default value: **12 seconds**). The alarm is cleared when GC time is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +17007 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Oozie responds slowly when it is used to submit tasks. + +Possible Causes +--------------- + +The heap memory of the Oozie instance is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check GC time.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Garbage Collection (GC) Time of the Oozie Process Exceeds the Threshold** > **Location**. Check the IP address of the instance involved in this alarm. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down menu in the chart area and choose **Customize** > **GC** > **Garbage Collection (GC) Time of Oozie**. Click **OK**. + +#. Check whether GC time of the Oozie process every second exceeds the threshold (default value: **12 seconds**). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-17007__l054b8501eb0f42fc8d5e96d6a497ec94: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* > **Services** > **Oozie** > **Configurations**. Click **All Configurations**. Search **GC_OPTS** in the search box. Increase the value of **-Xmx** as required, and click **Save**. Click **OK**. + + .. note:: + + Suggestions on GC parameter settings for Oozie: + + You are advised to set **-Xms** and **-Xmx** to the same value to prevent adverse impact on performance when JVM dynamically adjusts the heap memory size. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-17007__en-us_topic_0070543680_d0e32615: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select **Oozie** in the required cluster from the **Service** drop-down list. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417389.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18000_yarn_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18000_yarn_service_unavailable.rst new file mode 100644 index 0000000..cd498c8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18000_yarn_service_unavailable.rst @@ -0,0 +1,133 @@ +:original_name: ALM-18000.html + +.. _ALM-18000: + +ALM-18000 Yarn Service Unavailable +================================== + +Description +----------- + +This alarm is generated when the Yarn service is unavailable. The alarm module checks the Yarn service status every 60 seconds. + +The alarm is cleared when the Yarn service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18000 Critical Yes +======== ============== ===================== + +Parameters +---------- + +========== ======================================================= +Name Meaning +========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceNam Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide Yarn services. Users cannot run new applications. Submitted applications cannot be run. + +Possible Causes +--------------- + +- The ZooKeeper service is abnormal. +- The HDFS service is abnormal. +- There is no active ResourceManager instance in the Yarn cluster. +- All the NodeManagers in the Yarn cluster are abnormal. + +Procedure +--------- + +**Check ZooKeeper service status.** + +#. On the FusionInsight Manager, check whether the alarm list contains **ALM-13000 ZooKeeper Service Unavailable**. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-18000__li311182174725: + + Rectify the fault by following the steps provided in **ALM-13000 ZooKeeper Service Unavailable**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +**Check the HDFS service status.** + +3. .. _alm-18000__li19148237174725: + + On the FusionInsight Manager, check whether the alarm list contains the HDFS alarms. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +4. .. _alm-18000__li13219687174725: + + Choose **O&M > Alarm > Alarms**, handle HDFS alarms based on the alarm help, and check whether the Yarn alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the ResourceManager status in the Yarn cluster.** + +5. .. _alm-18000__li40584762174725: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn**. + +6. In **Dashboard**, check whether there is an active ResourceManager instance in the Yarn cluster. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`10 `. + +**Check the NodeManager node status in the Yarn cluster.** + +7. .. _alm-18000__li7454663174725: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance**. + +8. Query NodeManager **Running Status**, and check whether there are unhealthy nodes. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`10 `. + +9. .. _alm-18000__li25011012174725: + + Rectify the fault by following the steps provided in **ALM-18002 NodeManager Heartbeat Lost** or **ALM-18003 NodeManager Unhealthy**. After the fault is rectified, check whether the Yarn alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Collect fault information.** + +10. .. _alm-18000__li46526163174725: + + On the FusionInsight Manager portal of the active cluster, choose **O&M** > **Log > Download**. + +11. Select **Yarn** in the required cluster from the **Service**. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417390.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst new file mode 100644 index 0000000..d60d6b5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst @@ -0,0 +1,163 @@ +:original_name: ALM-18002.html + +.. _ALM-18002: + +ALM-18002 NodeManager Heartbeat Lost +==================================== + +Description +----------- + +The system checks the number of lost NodeManager nodes every 30 seconds, and compares the number with the threshold. The Number of Lost Nodes indicator has a default threshold. The alarm is generated when the value of Number of Lost Nodes exceeds the threshold. + +To change the threshold, on FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn**. On the displayed page, choose **Configurations** > **All Configurations**, and change the value of **yarn.nodemanager.lost.alarm.threshold**. You do not need to restart Yarn to make the change take effect. + +The default threshold is 0. The alarm is generated when the number of lost nodes exceeds the threshold, and is cleared when the number of lost nodes is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18002 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Lost Host Specifies the list of hosts with lost nodes. +=========== ======================================================= + +Impact on the System +-------------------- + +- The lost NodeManager node cannot provide the Yarn service. +- The number of containers decreases, so the cluster performance deteriorates. + +Possible Causes +--------------- + +- NodeManager is forcibly deleted without decommission. +- All the NodeManager instances are stopped or the NodeManager process is faulty. +- The host where the NodeManager node resides is faulty. +- The network between the NodeManager and ResourceManager is disconnected or busy. + +Procedure +--------- + +**Check the NodeManager status.** + +#. .. _alm-18002__li45221725183746: + + On the FusionInsight Manager, and choose **O&M** > **Alarm > Alarms**. Click |image1| before the alarm and obtain lost nodes in **Additional Information**. + +#. Check whether the lost nodes are hosts that have been manually deleted without decommission. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-18002__li10848276183746: + + After the setting, Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn**. On the displayed page, choose **Configurations** > **All Configurations**. Search for **yarn.nodemanager.lost.alarm.threshold** and change its value to the number of hosts that are not out of service and proactively deleted. After the setting, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm-18002__li30525620183746: + + Manually clear the alarm. Note that decommission must be performed before deleting hosts. + +#. .. _alm-18002__li58395585183746: + + On the FusionInsight Manager portal, choose **Cluster > Hosts**, and check whether the nodes obtained in :ref:`1 ` are healthy. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18002__li3142189183746: + + Rectify the node fault based on **ALM-12006 Node Fault** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check the process status.** + +7. .. _alm-18002__li26422237183746: + + On the FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* **> Services** > **Yarn** > **Instance**, and check whether there are NodeManager instances whose status is not **Good**. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`8 `. + +8. .. _alm-18002__li1508966183746: + + Check whether the NodeManager instance is deleted. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +9. .. _alm-18002__li42859132183746: + + Restart the active and standby ResourceManager instances, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Check the instance status.** + +10. .. _alm-18002__li10762250183746: + + Select NodeManager instances which running state is not **Normal** and restart them. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Check the network status.** + +11. .. _alm-18002__li13800982183746: + + Log in to the management node, **ping** the IP address of the lost NodeManager node to check whether the network is disconnected or busy. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-18002__li13119611183746: + + Rectify the network, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect fault information.** + +13. .. _alm-18002__li34738119183746: + + On the FusionInsight Manager in the active cluster, choose **O&M** > **Log > Download**. + +14. Select **Yarn** in the required cluster from the **Service**. + +15. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417391.png +.. |image2| image:: /_static/images/en-us_image_0269417392.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18003_nodemanager_unhealthy.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18003_nodemanager_unhealthy.rst new file mode 100644 index 0000000..556415e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18003_nodemanager_unhealthy.rst @@ -0,0 +1,135 @@ +:original_name: ALM-18003.html + +.. _ALM-18003: + +ALM-18003 NodeManager Unhealthy +=============================== + +Description +----------- + +The system checks the number of unhealthy NodeManager nodes every 30 seconds, and compares the number with the threshold. The Unhealthy Nodes indicator has a default threshold. This alarm is generated when the value of the Unhealthy Nodes indicator exceeds the threshold. + +To change the threshold, on FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn**. On the displayed page, choose **Configurations** > **All Configurations**, and change the value of **yarn.nodemanager.unhealthy.alarm.threshold**. You do not need to restart Yarn to make the change take effect. + +The default threshold is 0. The alarm is generated when the number of unhealthy nodes exceeds the threshold, and is cleared when the number of unhealthy nodes is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18003 Major Yes +======== ============== ===================== + +Parameters +---------- + +============== ======================================================= +Name Meaning +============== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Unhealthy Host Specifies the list of hosts with unhealthy nodes. +============== ======================================================= + +Impact on the System +-------------------- + +- The faulty NodeManager node cannot provide the Yarn service. +- The number of containers decreases, so the cluster performance deteriorates. + +Possible Causes +--------------- + +- The hard disk space of the host where the NodeManager node resides is insufficient. +- User **omm** does not have the permission to access a local directory on the NodeManager node. + +Procedure +--------- + +**Check the hard disk space of the host.** + +#. On the FusionInsight Manager, and choose **O&M** > **Alarm > Alarms**. Click |image1| before the alarm and obtain unhealthy nodes in **Additional Information**. + +#. .. _alm-18003__li11711432184458: + + Choose **Cluster** > *Name of the desired cluster* **> Services > Yarn** > **Instance**, select the NodeManager instance corresponding to the host, choose **Instance Configurations** **> All Configurations** and view disks corresponding to **yarn.nodemanager.local-dirs** and **yarn.nodemanager.log-dirs**. + +#. Choose **O&M > Alarm** **> Alarms**. In the alarm list, check whether the related disk has the alarm **ALM-12017 Insufficient Disk Capacity**. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-18003__li32014413184458: + + Rectify the disk fault based on **ALM-12017 Insufficient Disk Capacit** and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +#. .. _alm-18003__li64222725184458: + + Choose **Hosts** > *Name of the desired host .* On the **Dashboard** page, check the disk usage of the corresponding partition. Check whether the percentage of the used space of the mounted disk exceeds the value of **yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage** + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18003__li54043174184458: + + Reduce the disk usage to less than the value of **yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage**, wait for 10 to 20 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check the access permission of the local directory on each NodeManager node.** + +7. .. _alm-18003__li4571214184458: + + Obtain the NodeManager directory viewed in :ref:`2 `, log in to each NodeManager node as user **root**, and go to the obtained directory. + +8. Run the **ll** command to check whether the permission of the **localdir** and **containerlogs** folders is **755** and whether **User:Group** is **omm:ficommon**. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm-18003__li55292474184458: + + Run the following command to set the permission to **755** and **User:Group** to **omm:ficommon**: + + **chmod 755** ** + + **chown omm:ficommon** ** + +10. Wait for 10 to 20 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-18003__li64787675184458: + + On the FusionInsight Manager in the active cluster, choose **O&M** > **Log > Download**. + +12. Select **Yarn** in the required cluster from the **Service**. + +13. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417393.png +.. |image2| image:: /_static/images/en-us_image_0269417394.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18008_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18008_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst new file mode 100644 index 0000000..9827a42 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18008_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: ALM-18008.html + +.. _ALM-18008: + +ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the heap memory usage of Yarn ResourceManager every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of Yarn ResourceManager exceeds the threshold (95% of the maximum memory by default). + +Users can choose **O&M > Alarm >** **Thresholds** > *Name of the desired cluster* > **Yarn** to change the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the heap memory usage of Yarn ResourceManager is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the heap memory usage of Yarn ResourceManager is less than or equal to 95% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Yarn ResourceManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Yarn ResourceManager instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **ResourceManager** (Indicates the host name of the instance for which the alarm is generated)\ **.** Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **ResourceManager >** **Percentage of Used Memory of the ResourceManager**. Check the heap memory usage. + +#. Check whether the used heap memory of ResourceManager reaches 95% of the maximum heap memory specified for ResourceManager. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18008__li36040348185157: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **ResourceManager** > **System**. Increase the value of **GC_OPTS** parameter as required, click **Save**. Restart the role instance. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of ResourceManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms10G -Xmx10G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 1000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms20G -Xmx20G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 2000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms40G -Xmx40G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 3000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms60G -Xmx60G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 4000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms80G -Xmx80G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 5000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms100G -Xmx100G -XX:NewSize=3G -XX:MaxNewSize=6G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18008__li63055176185157: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - Yarn + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417395.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18009_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18009_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst new file mode 100644 index 0000000..c02f1d2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18009_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst @@ -0,0 +1,107 @@ +:original_name: ALM-18009.html + +.. _ALM-18009: + +ALM-18009 Heap Memory Usage of JobHistoryServer Exceeds the Threshold +===================================================================== + +Description +----------- + +The system checks the heap memory usage of Mapreduce JobHistoryServer every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of Mapreduce JobHistoryServer exceeds the threshold (95% of the maximum memory by default). + +Users can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Mapreduce** to change the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the heap memory usage of MapReduce JobHistoryServer is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the heap memory usage of MapReduce JobHistoryServer is less than or equal to 95% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18009 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Mapreduce JobHistoryServer is overhigh, the performance of Mapreduce log archiving is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Mapreduce JobHistoryServer instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18009 Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Mapreduce** > **Instance** > **JobHistoryServer.** Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **JobHistoryServer heap memory usage statistics**. JobHistoryServer indicates the corresponding HostName of the instance for which the alarm is generated. Check the heap memory usage. + +#. Check whether the used heap memory of JobHistoryServer reaches 95% of the maximum heap memory specified for JobHistoryServer. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18009__li4972156185337: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Mapreduce** > **Configurations** > **All** **Configurations** > **JobHistoryServer** > **System**. Increase the value of **GC_OPTS** parameter as required, click **Save**. Click **OK** and restart the role instance. + + .. note:: + + The mapping between the number of historical tasks (10000) and the memory of JobHistoryServer is as follows: + + -Xms30G -Xmx30G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18009__li30977950185337: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - Mapreduce + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417396.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18010_resourcemanager_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18010_resourcemanager_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..a81bcf3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18010_resourcemanager_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,111 @@ +:original_name: ALM-18010.html + +.. _ALM-18010: + +ALM-18010 ResourceManager GC Time Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the garbage collection (GC) duration of the ResourceManager process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18010 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the ResourceManager process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the ResourceManager instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18010 ResourceManager GC Time Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **ResourceManager (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Garbage Collection (GC) Time of ResourceManager** to check the GC duration statistics of the Broker process collected every minute. + +#. Check whether the GC duration of the ResourceManager process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18010__li52460707185521: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **Configurations** > **All** **Configurations** > **ResourceManager** > **System** to increase the value of **GC_OPTS** parameter as required. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of ResourceManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms10G -Xmx10G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 1000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms20G -Xmx20G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 2000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms40G -Xmx40G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 3000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms60G -Xmx60G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 4000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms80G -Xmx80G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 5000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms100G -Xmx100G -XX:NewSize=3G -XX:MaxNewSize=6G + +#. Save the configuration and restart the ResourceManager instance. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-18010__li2721601185521: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **ResourceManager** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417397.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18011_nodemanager_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18011_nodemanager_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..d3ec4d2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18011_nodemanager_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,106 @@ +:original_name: ALM-18011.html + +.. _ALM-18011: + +ALM-18011 NodeManager GC Time Exceeds the Threshold +=================================================== + +Description +----------- + +The system checks the garbage collection (GC) duration of the NodeManager process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18011 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the NodeManager process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the NodeManager instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18011 NodeManager GC Time Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **NodeManager (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Garbage Collection (GC) Time of NodeManager** to check the GC duration statistics of the Broker process collected every minute. + +#. Check whether the GC duration of the NodeManager process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18011__li47404003185648: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **NodeManager** > **System** to increase the value of **GC_OPTS** parameter as required. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of NodeManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters for NodeManager instances are as follows: -Xms2G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters for NodeManager instances are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters for NodeManager instances are as follows: -Xms8G -Xmx8G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Save the configuration and restart the NodeManager instance. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-18011__li14968197185648: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **NodeManager** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417398.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18012_jobhistoryserver_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18012_jobhistoryserver_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..76f8a81 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18012_jobhistoryserver_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,104 @@ +:original_name: ALM-18012.html + +.. _ALM-18012: + +ALM-18012 JobHistoryServer GC Time Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the garbage collection (GC) duration of the JobHistoryServer process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default). + +This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18012 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the JobHistoryServer process may interrupt the services. + +Possible Causes +--------------- + +The heap memory of the JobHistoryServer instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18012 JobHistoryServer GC Time Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Instance** > **JobHistoryServer (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Garbage Collection (GC) Time of the JobHistoryServer** to check the GC duration statistics of the Broker process collected every minute. + +#. Check whether the GC duration of the JobHistoryServer process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18012__li29276260185822: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Configurations** > **All** **Configurations** > **JobHistoryServer** > **System** to increase the value of **GC_OPTS** parameter as required. + + .. note:: + + The mapping between the number of historical tasks (10000) and the memory of the JobHistoryServer is as follows: + + -Xms30G -Xmx30G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Save the configuration and restart the JobHistoryServer instance. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-18012__li36309760185822: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **JobHistoryServer** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417399.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18013_resourcemanager_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18013_resourcemanager_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..e923802 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18013_resourcemanager_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: ALM-18013.html + +.. _ALM-18013: + +ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold +=================================================================== + +Description +----------- + +The system checks the direct memory usage of the Yarn service every 30 seconds. This alarm is generated when the direct memory usage of a ResourceManager instance exceeds the threshold (90% of the maximum memory). + +The alarm is cleared when the direct memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18013 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the Yarn service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the ResourceManager instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **ResourceManager (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Memory Usage Status of ResourceManager** to check the direct memory usage. + +#. Check whether the used direct memory of ResourceManager reaches 90% of the maximum direct memory specified for ResourceManager by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`9 `. + +#. .. _alm-18013__li3060052619112: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **ResourceManager** > **System** to increase the value of check whether **-XX:MaxDirectMemorySize** exists in the **GC_OPTS** parameter. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18013__li28439618491: + + In the **GC_OPTS** parameter, delete **-XX:MaxDirectMemorySize**. + +#. Save the configuration and restart the ResourceManager instance. + +#. .. _alm-18013__li558791074715: + + Check whether the **ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold** exists. + + - If yes, handle the alarm by referring to **ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold**. + - If no, go to :ref:`8 `. + +#. .. _alm-18013__li2441573219112: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-18013__li1521968019112: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **ResourceManager** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417400.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18014_nodemanager_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18014_nodemanager_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..d5296a9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18014_nodemanager_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: ALM-18014.html + +.. _ALM-18014: + +ALM-18014 NodeManager Direct Memory Usage Exceeds the Threshold +=============================================================== + +Description +----------- + +The system checks the direct memory usage of the Yarn service every 30 seconds. This alarm is generated when the direct memory usage of a NodeManager instance exceeds the threshold (90% of the maximum memory). + +The alarm is cleared when the direct memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18014 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the Yarn service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the NodeManager instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm > Alarms** > **ALM-18014 NodeManager Direct Memory Usage Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **NodeManager (IP address for which the alarm is generated)**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Resource** > **Percentage of** **Used** **Memory of the NodeManager** to check the direct memory usage. + +#. Check whether the used direct memory of NodeManager reaches 90% of the maximum direct memory specified for NodeManager by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`9 `. + +#. .. _alm-18014__li787981191259: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations**> **NodeManager** > **System** to check whether "-XX:MaxDirectMemorySize" exists in the **GC_OPTS** parameter. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18014__li66301833195114: + + In the **GC_OPTS** parameter, delete "-XX:MaxDirectMemorySize". + +#. Save the configuration and restart the NodeManager instance. + +#. .. _alm-18014__li735165905117: + + Check whether the **ALM-18018 NodeManager Heap Memory Usage Exceeds the Threshold** exists. + + - If yes, handle the alarm by referring to **ALM-18018 NodeManager Heap Memory Usage Exceeds the Threshold**. + - If no, go to :ref:`8 `. + +#. .. _alm-18014__li56845771191259: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-18014__li34398621191259: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **NodeManager** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417401.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18015_jobhistoryserver_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18015_jobhistoryserver_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..522c286 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18015_jobhistoryserver_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: ALM-18015.html + +.. _ALM-18015: + +ALM-18015 JobHistoryServer Direct Memory Usage Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the direct memory usage of the MapReduce service every 30 seconds. This alarm is generated when the direct memory usage of a JobHistoryServer instance exceeds the threshold (90% of the maximum memory). + +The alarm is cleared when the direct memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18015 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the MapReduce service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the JobHistoryServer instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18015 JobHistoryServer Direct Memory Usage Exceeds the Threshold** > **Location** to check the IP address of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Instance** > **JobHistoryServer (IP address for which the alarm is generated).** Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Memory Usage Status of JobHistoryServer** to check the direct memory usage. + +#. Check whether the used direct memory of JobHistoryServer reaches 90% of the maximum direct memory specified for JobHistoryServer by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`9 `. + +#. .. _alm-18015__li7519563191459: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Configurations** > **All** **Configurations** > **JobHistoryServer** > **System** to check whether "-XX:MaxDirectMemorySize" exists in the **GC_OPTS** parameter. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm-18015__li16830456145416: + + In the **GC_OPTS** parameter, delete "-XX:MaxDirectMemorySize". + +#. Save the configuration and restart the JobHistoryServer instance. + +#. .. _alm-18015__li195912241558: + + Check whether the **ALM-18009 Heap Memory Usage of JobHistoryServer Exceeds the Threshold** exists. + + - If yes, handle the alarm by referring to **ALM-18009 Heap Memory Usage of JobHistoryServer Exceeds the Threshold**. + - If no, go to :ref:`8 `. + +#. .. _alm-18015__li53290472191459: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-18015__li59831061191459: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +10. Select **JobHistoryServer** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417402.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18016_non_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18016_non_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst new file mode 100644 index 0000000..d1e5da4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18016_non_heap_memory_usage_of_resourcemanager_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: ALM-18016.html + +.. _ALM-18016: + +ALM-18016 Non Heap Memory Usage of ResourceManager Exceeds the Threshold +======================================================================== + +Description +----------- + +The system checks the Non Heap memory usage of Yarn ResourceManager every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the Non Heap memory usage of Yarn ResourceManager exceeds the threshold (90% of the maximum memory by default). + +Users can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** to change the threshold. + +The alarm is cleared when the Non Heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18016 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the Non Heap memory usage of Yarn ResourceManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The Non Heap memory of the Yarn ResourceManager instance on the node is overused or the Non Heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the Non Heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18016 Non Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **ResourceManager**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Percentage of Used Memory of the ResourceManager**. ResourceManager indicates the corresponding HostName of the instance for which the alarm is generated. Check the Non Heap memory usage. + +#. Check whether the used Non Heap memory of ResourceManager reaches 90% of the maximum Non Heap memory specified for ResourceManager by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18016__li66250658191713: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **ResourceManager** > **System**. Adjust the **GC_OPTS** memory parameter of ResourceManager. Save the configuration and restart the ResourceManager instance. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of ResourceManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms10G -Xmx10G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 1000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms20G -Xmx20G -XX:NewSize=1G -XX:MaxNewSize=2G + - If the number of NodeManager instances in the cluster reaches 2000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms40G -Xmx40G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 3000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms60G -Xmx60G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 4000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms80G -Xmx80G -XX:NewSize=2G -XX:MaxNewSize=4G + - If the number of NodeManager instances in the cluster reaches 5000, the recommended JVM parameters of the ResourceManager instance are as follows: -Xms100G -Xmx100G -XX:NewSize=3G -XX:MaxNewSize=6G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18016__li66959012191713: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - Yarn + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417403.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18017_non_heap_memory_usage_of_nodemanager_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18017_non_heap_memory_usage_of_nodemanager_exceeds_the_threshold.rst new file mode 100644 index 0000000..4c0c9cc --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18017_non_heap_memory_usage_of_nodemanager_exceeds_the_threshold.rst @@ -0,0 +1,109 @@ +:original_name: ALM-18017.html + +.. _ALM-18017: + +ALM-18017 Non Heap Memory Usage of NodeManager Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the Non Heap memory usage of Yarn NodeManager every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the Non Heap memory usage of Yarn NodeManager exceeds the threshold (90% of the maximum memory by default). + +Users can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** to change the threshold. + +The alarm is cleared when the Non Heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18017 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the Non Heap memory usage of Yarn NodeManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The Non Heap memory of the Yarn NodeManager instance on the node is overused or the Non Heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the Non Heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18017 Non Heap Memory Usage of Yarn NodeManager Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **NodeManager**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Resource** > **Percentage of Used Memory of the NodeManager**. NodeManager indicates the corresponding HostName of the instance for which the alarm is generated. Check the Non Heap memory usage. + +#. Check whether the used Non Heap memory of NodeManager reaches 90% of the maximum Non Heap memory specified for NodeManager by default. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18017__li59571663191848: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **NodeManager** > **System**. Adjust the **GC_OPTS** memory parameter of NodeManager, click **Save**, and click **OK,** and restart the role instance. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of NodeManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters for NodeManager instances are as follows: -Xms2G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters for NodeManager instances are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters for NodeManager instances are as follows: -Xms8G -Xmx8G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18017__li42501753191848: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - Yarn + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417404.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18018_nodemanager_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18018_nodemanager_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..7d42be2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18018_nodemanager_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,107 @@ +:original_name: ALM-18018.html + +.. _ALM-18018: + +ALM-18018 NodeManager Heap Memory Usage Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the heap memory usage of Yarn NodeManager every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of Yarn NodeManager exceeds the threshold (95% of the maximum memory by default). + +The alarm is cleared when the heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18018 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Yarn NodeManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Yarn NodeManager instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18018 NodeManager Heap Memory Usage Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Instance** > **NodeManager**. Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **Resource** > **Percentage of** **Used** **Memory of the NodeManager** to check the heap memory usage. + +#. Check whether the used heap memory of NodeManager reaches 95% of the maximum heap memory specified for NodeManager. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18018__li2965405192027: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Yarn** > **Configurations** > **All** **Configurations** > **NodeManager** > **System**. Increase the value of **GC_OPTS** parameter as required, click **Save**, and click **OK**, and restart the role instance. + + .. note:: + + The mapping between the number of NodeManager instances in a cluster and the memory size of NodeManager is as follows: + + - If the number of NodeManager instances in the cluster reaches 100, the recommended JVM parameters for NodeManager instances are as follows: -Xms2G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 200, the recommended JVM parameters for NodeManager instances are as follows: -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=1G + - If the number of NodeManager instances in the cluster reaches 500, the recommended JVM parameters for NodeManager instances are as follows: -Xms8G -Xmx8G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18018__li37737004192027: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - Yarn + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417405.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18019_non_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18019_non_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst new file mode 100644 index 0000000..41d6af1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18019_non_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold.rst @@ -0,0 +1,107 @@ +:original_name: ALM-18019.html + +.. _ALM-18019: + +ALM-18019 Non Heap Memory Usage of JobHistoryServer Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the Non Heap memory usage of MapReduce JobHistoryServer every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the Non Heap memory usage of MapReduce JobHistoryServer exceeds the threshold (90% of the maximum memory by default). + +Users can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **MapReduce** to change the threshold. + +The alarm is cleared when the Non Heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18019 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the Non Heap memory usage of MapReduce JobHistoryServer is overhigh, the performance of MapReduce task submission and operation is affected. In addition, a memory overflow may occur so that the MapReduce service is unavailable. + +Possible Causes +--------------- + +The Non Heap memory of the MapReduce JobHistoryServer instance on the node is overused or the Non Heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the Non Heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** > **ALM-18019 Non Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold** > **Location**. Check the HostName of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Instance** > **JobHistoryServer.** Click the drop-down menu in the upper right corner of **Chart**, choose **Customize** > **JobHistoryServer Non Heap memory usage statistics**. JobHistoryServer indicates the corresponding HostName of the instance for which the alarm is generated. Check the Non Heap memory usage. + +#. Check whether the used Non Heap memory of JobHistoryServer reaches 90% of the maximum Non Heap memory specified for JobHistoryServer. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-18019__li46897838192216: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **MapReduce** > **Configurations** > **All** **Configurations** > **JobHistoryServer** > **System**. Adjust the **GC_OPTS** memory parameter of the NodeManager, click **Save**, and click **OK,** and restart the role instance. + + .. note:: + + The mapping between the number of historical tasks (10000) and the memory of the JobHistoryServer is as follows: + + -Xms30G -Xmx30G -XX:NewSize=1G -XX:MaxNewSize=2G + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-18019__li43891349192216: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following node in the required cluster from the **Service**. + + - NodeAgent + - MapReduce + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417406.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18020_yarn_task_execution_timeout.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18020_yarn_task_execution_timeout.rst new file mode 100644 index 0000000..10f8772 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18020_yarn_task_execution_timeout.rst @@ -0,0 +1,122 @@ +:original_name: ALM-18020.html + +.. _ALM-18020: + +ALM-18020 Yarn Task Execution Timeout +===================================== + +Description +----------- + +The system checks MapReduce and Spark tasks (except for permanent JDBC tasks) submitted to Yarn every 15 minutes. This alarm is generated when the task execution time exceeds the timeout duration specified by the user. However, the task can be properly executed. The client timeout parameter of MapReduce is mapreduce.application.timeout.alarm and that of Spark is spark.application.timeout.alarm. The unit is ms. + +This alarm is cleared when the task is finished or terminated. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18020 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| ApplicationName | Specifies the object (application ID) for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+-------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The alarm persists after task execution times out. However, the task can still be properly executed, so this alarm does not exert any impact on the system. + +Possible Causes +--------------- + +- The specified timeout duration is shorter than the required execution time. +- The queue resources for task running are insufficient. +- Task data skew occurs. As a result, some tasks process a large amount of data and take a long time to execute. + +Procedure +--------- + +**Check whether the timeout interval is correctly set.** + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. The **Alarms** page is displayed. +#. Select the alarm whose ID is **18020**. In the alarm details, view **Location** to obtain the timeout task name and timeout duration. +#. Based on the task name and timeout interval, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **ResourceManager (Active)** to log in to the native Yarn page. Then find the task on the native page, check its **StartTime** and calculate the task execution time based on the current system time. Check whether the task execution time exceeds the timeout duration. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`10 `. + +#. Evaluate the expected task execution time based on the service and compare it with the task timeout interval. If the timeout interval is too short, set the timeout interval (**mapreduce.application.timeout.alarm** or **spark.application.timeout.alarm**) of the client to the task expected execution time. Run the task again and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether the queue resources are sufficient.** + +5. .. _alm-18020__li9996125375313: + + Find the task on the native page and view the queue name of the task. Click **Scheduler** on the left of the native page. On the **Applications Queues** page, find the corresponding queue name and expand the queue details, as shown in the following figure. + + |image1| + +6. Check whether the value of **Used Resources** in the queue details is approximately equal to the value of **Max Resources**, which indicates that the resources in the queue submitted by the task have been used up. If the queue resources are insufficient, choose **Tenant Resources** > **Dynamic Resource Plan** > **Resource Distribution Policy** on FusionInsight Manager and increase the value of **Max Resources** for the queue. Run the task again and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether data skew occurs.** + +7. .. _alm-18020__li5526143185420: + + On the native Yarn page, click *task ID* (for example, **application_1565337919723_0002**) > **Tracking URL:ApplicationMaster** > **job_1565337919723_0002**. The following page is displayed. + + |image2| + +8. Choose **Job** > **Map tasks** or **Job** > **Reduce tasks** on the left and check whether the execution time of each Map or Reduce task differs greatly. If yes, task data skew occurs. In this case, you need to balance the task data. + +9. Rectify the fault based on the preceding causes and perform the tasks again. Then, check whether the alarm persists. + + - If yes, go to :ref:`10 `. + - If no, no further action is required. + +**Collect the fault information.** + +10. .. _alm-18020__li6394993485922: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +11. Expand the **Service** drop-down list, and select **Yarn** for the target cluster. + +12. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001087171010.png +.. |image2| image:: /_static/images/en-us_image_0000001439562477.png +.. |image3| image:: /_static/images/en-us_image_0263895445.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18021_mapreduce_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18021_mapreduce_service_unavailable.rst new file mode 100644 index 0000000..caec951 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18021_mapreduce_service_unavailable.rst @@ -0,0 +1,147 @@ +:original_name: ALM-18021.html + +.. _ALM-18021: + +ALM-18021 Mapreduce Service Unavailable +======================================= + +Description +----------- + +The alarm module checks the MapReduce service status every 60 seconds. This alarm is generated when the system detects that the MapReduce service is unavailable. + +The alarm is cleared when the MapReduce service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18021 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide the MapReduce service. For example, MapReduce cannot be used to view task logs or the log archive function is unavailable. + +Possible Causes +--------------- + +- The JobHistoryServer instance is abnormal. +- The KrbServer service is abnormal. +- The ZooKeeper service abnormal. +- The HDFS service abnormal. +- The Yarn service is abnormal. + +Procedure +--------- + +**Check** **MapReduce service JobHistoryServer instance status.** + +#. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **MapReduce** > **Instance**. +#. Check whether the running status of JobHistoryServer is **Normal**. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`3 `. + +**Check the KrbServer service status.** + +3. .. _alm-18021__li2896399895811: + + In the alarm list on FusionInsight Manager, check whether **ALM-25500 KrbServer Service Unavailable** exists. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +4. .. _alm-18021__li6544511395811: + + Rectify the fault by following the steps provided in **ALM-25500 KrbServer Service Unavailable**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the ZooKeeper service.** + +5. .. _alm-18021__li4793762895811: + + In the alarm list on FusionInsight Manager, check whether **ALM-13000 ZooKeeper Service Unavailable** exists. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +6. .. _alm-18021__li4474654695811: + + Rectify the fault by following the steps provided in **ALM-13000 ZooKeeper Service Unavailable**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check the HDFS service status.** + +7. .. _alm-18021__li247717695811: + + In the alarm list on FusionInsight Manager, check whether **ALM-14000 HDFS Service Unavailable** exists. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-18021__li5261529695811: + + Rectify the fault by following the steps provided in **ALM-14000 HDFS Service Unavailable**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Check the Yarn service status.** + +9. .. _alm-18021__li19148237174725: + + In the alarm list on FusionInsight Manager, check whether **ALM-18000 Yarn Service Unavailable** exists. + + - If yes, go to :ref:`10 ` + - If no, go to :ref:`11 `. + +10. .. _alm-18021__li13219687174725: + + Rectify the fault by following the steps provided in **ALM-18000 Yarn Service Unavailable**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect fault information.** + +11. .. _alm-18021__li795116716116: + + On the FusionInsight Manager home page of the active cluster, choose **O&M** > **Log > Download.** + +12. Select **MapReduce** in the required cluster from the **Service.** + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417409.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18022_insufficient_yarn_queue_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18022_insufficient_yarn_queue_resources.rst new file mode 100644 index 0000000..e78c9ff --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18022_insufficient_yarn_queue_resources.rst @@ -0,0 +1,139 @@ +:original_name: ALM-18022.html + +.. _ALM-18022: + +ALM-18022 Insufficient Yarn Queue Resources +=========================================== + +Description +----------- + +The alarm module checks Yarn queue resources every 60 seconds. This alarm is generated when available resources or ApplicationMaster (AM) resources of a queue are insufficient. + +This alarm is cleared when available resources are sufficient. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18022 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Parameter Name | Description | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| QueueName | Specifies the queue for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| QueueMetric | Specifies the metric of the queue for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- An application being executed takes longer time. +- An application fails to be executed for a long time after being submitted. + +Possible Causes +--------------- + +- NodeManager node resources are insufficient. +- The configured maximum resource capacity of the queue is excessively small. +- The configured maximum AM resource percentage is excessively small. + +Procedure +--------- + +**View alarm details.** + +#. On the FusionInsight Manager, choose **O&M** > **Alarm > Alarms**. +#. View location information of this alarm and check whether **QueueName** is **root** and **QueueMetric** is **Memory** or **QueueName** is **root** and **QueueMetric** is **vCores**. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +3. .. _alm-18022__li1118842213014: + + The memory or CPU of the Yarn cluster is insufficient. In this case, log in to the node where NodeManager resides and run the **free -g** and **cat /proc/cpuinfo** commands to query the available memory and available CPU of the node, respectively. On FusionInsight Manager, increase the values of **yarn.nodemanager.resource.memory-mb** and **yarn.nodemanager.resource.cpu-vcores** for the Yarn NodeManager based on the query results. Then, restart the NodeManager instance. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +4. .. _alm-18022__li550716317319: + + View location information of this alarm and check whether **QueueName** is **** and **QueueMetric** is **Memory**, or **QueueName** is **** and **QueueMetric** is **vCores** in **Location**, check whether **available Memory =** or **available vCores =** are included in **Additional Information**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +5. .. _alm-18022__li11123116735: + + The memory or CPU of the tenant queue is insufficient. In this case, choose **Tenant Resources** > **Dynamic Resource Plan > Resource Distribution Policy** and increase the value of **Maximum Capacity**. Then, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +6. .. _alm-18022__li109354114148: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **Configurations** > **All Configurations**. Enter the keyword "threshold" and click **ResourceManager**. Adjust the threshold values of the following parameters: + + If **Additional Information** contains **available Memory =**, change the value of **yarn.queue.memory.alarm.threshold** to a value smaller than that of **available Memory =** in **Additional Information**. + + If **Additional Information** contains **available vCores =**, change the value of **yarn.queue.vcore.alarm.threshold** to a value smaller than that of **available vCores =** in **Additional Information**. + + Wait for five minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +7. .. _alm-18022__li1189109935: + + If **available AmMemory =** or **available AmvCores =** is included in **Additional Information**, ApplicationMaster memory or CPU of the tenant queue is insufficient. In this case, choose **Tenant Resources** > **Dynamic Resource Plan** > **Queue Configuration** and increase the value of **Maximum Am Resource Percent**. Then, check whether this alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +8. .. _alm-18022__li1382974791617: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **Configurations** > **All Configurations**. Enter the keyword "threshold" and click **ResourceManager**. Adjust the threshold values of the following parameters: + + If **Additional Information** contains **available AmMemory =**, change the value of **yarn.queue.memory.alarm.threshold** to a value smaller than that of **available AmMemory =** in **Additional Information**. + + If **Additional Information** contains **available AmvCores =**, change the value of **yarn.queue.vcore.alarm.threshold** to a value smaller than that of **available AmvCores =** in **Additional Information**. + + Wait for five minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-18022__li1973131339: + + Log in to FusionInsight Manager of the active cluster, and choose **O&M** > **Log** > **Download**. + +10. Select **Yarn** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417410.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18023_number_of_pending_yarn_tasks_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18023_number_of_pending_yarn_tasks_exceeds_the_threshold.rst new file mode 100644 index 0000000..20b81af --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18023_number_of_pending_yarn_tasks_exceeds_the_threshold.rst @@ -0,0 +1,116 @@ +:original_name: ALM-18023.html + +.. _ALM-18023: + +ALM-18023 Number of Pending Yarn Tasks Exceeds the Threshold +============================================================ + +Description +----------- + +The alarm module checks the number of pending applications in the Yarn root queue every 60 seconds. The alarm is generated when the number exceeds 60. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18023 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| QueueName | Identifies the queue for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| QueueMetric | Identifies the queue indicator for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- It takes long time to end an application. +- A new application cannot run after submission. + +Possible Causes +--------------- + +- NodeManager node resources are insufficient. +- The maximum resource capacity of the queue and the maximum AM resource percentage are too small. +- The monitoring threshold is too small. + +Procedure +--------- + +**Check NodeManager resources.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **ResourceManager (Active)** to access the ResourceManager web UI. + +#. Click **Scheduler** and check whether the root queue resources are used up in **Application Queues**. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-18023__li1894618168247: + + Expand the capacity of the NodeManager instance of the Yarn service. After the capacity expansion, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check the maximum queue resource capacity and the maximum AM resource percentage.** + +4. .. _alm-18023__li156321342274: + + Check whether the resources of the queue corresponding to the pending task are used up. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-18023__li1663218419278: + + On FusionInsight Manager, choose **Tenant Resources** > **Dynamic Resource Plan** and add resources as required. Check whether the alarms are cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Adjust the monitoring thresholds.** + +6. .. _alm-18023__li15314143611285: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** > **Applications** > **Pending Applications**, and increase the thresholds as required. + +7. Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-18023__li76841314475: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **Yarn** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895802.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18024_pending_yarn_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18024_pending_yarn_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..29108b5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18024_pending_yarn_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,116 @@ +:original_name: ALM-18024.html + +.. _ALM-18024: + +ALM-18024 Pending Yarn Memory Usage Exceeds the Threshold +========================================================= + +Description +----------- + +The alarm module checks the pending memory of Yarn every 60 seconds. The alarm is generated when the pending memory exceeds the threshold. Pending memory indicates the total memory that is not allocated to submitted Yarn applications. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18024 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| QueueName | Identifies the queue for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| QueueMetric | Identifies the queue indicator for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- It takes long time to end an application. +- A new application cannot run after submission. + +Possible Causes +--------------- + +- NodeManager node resources are insufficient. +- The maximum resource capacity of the queue and the maximum AM resource percentage are too small. +- The monitoring threshold is too small. + +Procedure +--------- + +**Check NodeManager resources.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **ResourceManager (Active)** to access the ResourceManager web UI. + +#. Click **Scheduler** and check whether the root queue resources are used up in **Application Queues**. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-18024__li1894618168247: + + Expand the capacity of the NodeManager instance of the Yarn service. After the capacity expansion, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check the maximum queue resource capacity and the maximum AM resource percentage.** + +4. .. _alm-18024__li156321342274: + + Check whether the resources of the queue corresponding to the pending task are used up. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-18024__li1663218419278: + + On FusionInsight Manager, choose **Tenant Resources** > **Dynamic Resource Plan** and add resources as required. Check whether the alarms are cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Adjust the monitoring thresholds.** + +6. .. _alm-18024__li15314143611285: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** > **CPU and Memory** > **Pending Memory**, and increase the threshold as required. + +7. Check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-18024__li76841314475: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **Yarn** for the target cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895617.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18025_number_of_terminated_yarn_tasks_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18025_number_of_terminated_yarn_tasks_exceeds_the_threshold.rst new file mode 100644 index 0000000..5f3c367 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18025_number_of_terminated_yarn_tasks_exceeds_the_threshold.rst @@ -0,0 +1,101 @@ +:original_name: ALM-18025.html + +.. _ALM-18025: + +ALM-18025 Number of Terminated Yarn Tasks Exceeds the Threshold +=============================================================== + +Description +----------- + +The alarm module checks the number of terminated applications in the Yarn root queue every 60 seconds. The alarm is generated when the number exceeds 50 for three consecutive times. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18025 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Cluster Name | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Service Name | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Role Name | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A large number of application tasks are forcibly terminated. + +Possible Causes +--------------- + +- The user forcibly terminates a large number of tasks. +- The system terminates tasks due to some error. + +Procedure +--------- + +**Check the alarm details.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** to go to the alarm page. + +#. View **Additional Information** in the alarm details to check whether the alarm threshold is too small. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-18025__li20880184714436: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** > **Other** > **Terminated Applications of root queue** to modify the threshold. Go to :ref:`6 `. + +#. .. _alm-18025__li2088019471430: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **ResourceManager(Active)** to access the ResourceManager web UI. + +#. Click **KILLED** in **Applications** and click the task on the top. View the description of **Diagnostics** and rectify the fault based on the task termination details (for example, the task is terminated by a user). + +#. .. _alm-18025__li4883154713439: + + Wait for 3 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect the fault information.** + +7. .. _alm-18025__li4879124718434: + + On the FusionInsight Manager, choose **O&M > Log > Download**. + +8. Expand the **Service** drop-down list, and select **Yarn** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417413.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18026_number_of_failed_yarn_tasks_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18026_number_of_failed_yarn_tasks_exceeds_the_threshold.rst new file mode 100644 index 0000000..35dffe3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-18026_number_of_failed_yarn_tasks_exceeds_the_threshold.rst @@ -0,0 +1,101 @@ +:original_name: ALM-18026.html + +.. _ALM-18026: + +ALM-18026 Number of Failed Yarn Tasks Exceeds the Threshold +=========================================================== + +Description +----------- + +The alarm module checks the number of failed applications in the Yarn root queue every 60 seconds. The alarm is generated when the number exceeds 50 for three consecutive times. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18026 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Cluster Name | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Service Name | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Role Name | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- A large number of application tasks fail to be executed. +- Failed tasks need to be submitted again. + +Possible Causes +--------------- + +The task fails to be executed due to some error. + +Procedure +--------- + +**Check the alarm details.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** to go to the alarm page. + +#. View **Additional Information** in the alarm details to check whether the alarm threshold is too small. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-18026__li7914580481: + + Choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Yarn** > **Other** > **Failed Applications of root queue** to modify the threshold. Go to :ref:`6 `. + +#. .. _alm-18026__li691418874816: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **ResourceManager(Active)** to access the ResourceManager web UI. + +#. Click **FAILED** in **Applications** and click the task on the top. View the description of **Diagnostics** and rectify the fault based on the task failure causes. + +#. .. _alm-18026__li291712812484: + + Wait for 3 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect the fault information.** + +7. .. _alm-18026__li189131680485: + + On the FusionInsight Manager, choose O&M > Log > Download. + +8. Expand the **Service** drop-down list, and select **Yarn** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417414.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19000_hbase_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19000_hbase_service_unavailable.rst new file mode 100644 index 0000000..b384137 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19000_hbase_service_unavailable.rst @@ -0,0 +1,280 @@ +:original_name: ALM-19000.html + +.. _ALM-19000: + +ALM-19000 HBase Service Unavailable +=================================== + +Description +----------- + +This alarm is generated when the HBase service is unavailable. The alarm module checks the HBase service status every 120 seconds. + +This alarm is cleared when the HBase service recovers. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, ServiceName=HBase1 is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19000 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Operations, such as reading or writing data and creating tables, cannot be performed. + +Possible Causes +--------------- + +- The ZooKeeper service is abnormal. +- The HDFS service is abnormal. +- The HBase service is abnormal. +- The network is abnormal. + +Procedure +--------- + +**Check the ZooKeeper service status.** + +#. On the FusionInsight Manager, check whether the running status of ZooKeeper is **Normal** on service list. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`2 `. + +#. .. _alm-19000__li42710393192610: + + In the alarm list, check whether **ALM-13000 ZooKeeper Service Unavailable** exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-19000__li36989843192610: + + Rectify the fault by following the steps provided in **ALM-13000 ZooKeeper Service Unavailable**. + +#. Wait several minutes, and check whether alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the HDFS service status.** + +5. .. _alm-19000__li31549687192610: + + In the alarm list, check whether **ALM-14000 HDFS Service Unavailable** exists. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`8 `. + +6. .. _alm-19000__li5387888192610: + + Rectify the fault by following the steps provided in **ALM-14000 HDFS Service Unavailable**. + +7. Wait several minutes, and check whether alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +8. .. _alm-19000__li7395880192610: + + On the FusionInsight Manager portal, choose **Cluster** *> Name of the desired cluster* > **Services** > **HDFS**. Check whether **Safe Mode** is **ON**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`12 `. + +9. .. _alm-19000__li42432199192610: + + Log in to the HDFS client as user **root**. Run **cd** to switch to the client installation directory, and run **source bigdata_env**. + + If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the administrator, run the **kinit hdfs** command and enter the password as prompted. + +10. Run the following command to manually exit the safe mode: + + **hdfs dfsadmin -safemode leave** + +11. Wait several minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Check the HBase service status.** + +12. .. _alm-19000__li3109192610: + + On the FusionInsight Manager portal, click **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. + +13. Check whether there is one active HMaster and one standby HMaster. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`14 `. + +14. .. _alm-19000__li51944053192610: + + Click **Instances**, select the HMaster whose status is not **Active**, click **More**, and select **Restart Instance** to restart the HMaster. Check whether there is one active HMaster and one standby HMaster again. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`21 `. + +15. .. _alm-19000__li26121173192610: + + Choose **Cluster** >\ *Name of the desired cluster* > **Services** > **HBase** > **HMaster(Active)** to go to the HMaster WebUI. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + +16. Check whether at least one RegionServer exists under **Region Servers**. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`21 `. + +17. .. _alm-19000__li52728456192610: + + Check **Tables** > **System Tables**, as shown in :ref:`Figure 1 `. Check whether **hbase:meta**, **hbase:namespace**, and **hbase:acl** exist in the **Table Name** column. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`19 `. + + .. _alm-19000__fig13078536192610: + + .. figure:: /_static/images/en-us_image_0269417415.png + :alt: **Figure 1** HBase system table + + **Figure 1** HBase system table + +18. .. _alm-19000__li52774331192610: + + As shown in :ref:`Figure 1 `, click the **hbase:meta**, **hbase:namespace**, and **hbase:acl** hyperlinks and check whether the pages are properly displayed. If the pages are properly displayed, the tables are normal. + + If they are, go to :ref:`19 `. + + If they are not, go to :ref:`23 `. + + .. note:: + + In normal mode, **ACL** is enabled for HBase by default. The **hbase:acl** table is generated only when **ACL** is manually enabled. In this case, check this table. In other scenarios, this table does not need to be checked. + +19. .. _alm-19000__li2123961192610: + + View the HMaster startup status. + + In :ref:`Figure 2 `, if the **RUNNING** state exists in **Tasks**, HMaster is being started. In the **State** column, you can view the time when HMaster is in the **RUNNING** state. In :ref:`Figure 3 `, if the state is **COMPLETE**, HMaster is started. + + Check whether HMaster is in the **RUNNING** state for a long time. + + .. _alm-19000__fig2133867192610: + + .. figure:: /_static/images/en-us_image_0269417416.png + :alt: **Figure 2** HMaster is being started + + **Figure 2** HMaster is being started + + .. _alm-19000__fig41660353192610: + + .. figure:: /_static/images/en-us_image_0269417417.png + :alt: **Figure 3** HMaster is started + + **Figure 3** HMaster is started + + - If yes, go to :ref:`20 `. + - If no, go to :ref:`21 `. + +20. .. _alm-19000__li34107122192610: + + On the HMaster WebUI, check whether any hbase:meta is in the **Region in Transition** state for a long time. + + + .. figure:: /_static/images/en-us_image_0269417418.png + :alt: **Figure 4** Region in Transition + + **Figure 4** Region in Transition + + - If yes, go to :ref:`21 `. + - If no, go to :ref:`22 `. + +21. .. _alm-19000__li23797537192610: + + In the precondition that services are not affected, log in to the FusionInsight Manager portal and choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **More** > **Restart Service**. Enter the administrator password and click **OK**. + + - If yes, go to :ref:`22 `. + - If no, go to :ref:`23 `. + +22. .. _alm-19000__li53096940192610: + + Wait several minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`23 `. + +**Check the network connection between HMaster and dependent components.** + +23. .. _alm-19000__li52963882192610: + + On the FusionInsight Manager, choose **Cluster** >\ *Name of the desired cluster* > **Services** > **HBase**. + +24. .. _alm-19000__li6333253192610: + + Click **Instance** and the HMaster instance list is displayed. Record the **management IP Address** in the row of **HMaster(Active)**. + +25. Use the IP address obtained in :ref:`24 ` to log in to the host where the active HMaster runs as user **omm** . + +26. Run the **ping** command to check whether communication between the host that runs the active HMaster and the hosts that run the dependent components. (The dependent components include ZooKeeper, HDFS and Yarn. Obtain the IP addresses of the hosts that run these services in the same way as that for obtaining the IP address of the active HMaster.) + + - If yes, go to :ref:`29 `. + - If no, go to :ref:`27 `. + +27. .. _alm-19000__li11937281192610: + + Contact the administrator to restore the network. + +28. In the alarm list, check whether **HBase Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`29 `. + +**Collect fault information.** + +29. .. _alm-19000__li5658542192610: + + On the FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +30. Select the following nodes in the required cluster from the **Service** drop-down list: + + - ZooKeeper + - HDFS + - HBase + +31. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +32. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417419.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst new file mode 100644 index 0000000..24181f0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst @@ -0,0 +1,188 @@ +:original_name: ALM-19006.html + +.. _ALM-19006: + +ALM-19006 HBase Replication Sync Failed +======================================= + +Description +----------- + +The alarm module checks the HBase DR data synchronization status every 30 seconds. When disaster recovery (DR) data fails to be synchronized to a standby cluster, the alarm is triggered. + +When DR data synchronization succeeds, the alarm is cleared. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19006 Critical Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +HBase data in a cluster fails to be synchronized to the standby cluster, causing data inconsistency between active and standby clusters. + +Possible Causes +--------------- + +- The HBase service on the standby cluster is abnormal. +- A network exception occurs. + +Procedure +--------- + +**Observe whether the system automatically clears the alarm.** + +#. On the FusionInsight Manager portal of the active cluster, click **O&M** > **Alarm** > **Alarms.** + +#. In the alarm list, click the alarm to obtain alarm generation time from **Generated** of the alarm. Check whether the alarm has existed for five minutes. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-19006__li5327925819413: + + Wait five minutes and check whether the system automatically clears the alarm. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check the HBase service status of the standby cluster.** + +4. .. _alm-19006__li2065263819413: + + Log in to the FusionInsight Manager portal of the active cluster, and click **O&M** > **Alarm** > **Alarms.** + +5. In the alarm list, click the alarm to obtain **HostName** from **Location**. + +6. Access the node where the HBase client of the active cluster resides as user **omm**. + + If the cluster uses a security mode, perform security authentication first and then access the **hbase shell** interface as user **hbase**. + + **cd /opt/client** + + **source ./bigdata_env** + + **kinit** *hbaseuser* + +7. Run the **status 'replication', 'source'** command to check the DR synchronization status of the faulty node. + + The DR synchronization status of a node is as follows. + + .. code-block:: + + 10-10-10-153: + SOURCE: PeerID=abc, SizeOfLogQueue=0, ShippedBatches=2, ShippedOps=2, ShippedBytes=320, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=0, TimeStampsOfLastShippedOp=Mon Jul 18 09:53:28 CST 2016, Replication Lag=0, FailedReplicationAttempts=0 + SOURCE: PeerID=abc1, SizeOfLogQueue=0, ShippedBatches=1, ShippedOps=1, ShippedBytes=160, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=16788, TimeStampsOfLastShippedOp=Sat Jul 16 13:19:00 CST 2016, Replication Lag=16788, FailedReplicationAttempts=5 + +8. Obtain **PeerID** corresponding to a record whose **FailedReplicationAttempts** value is greater than 0. + + In the preceding step, data on the faulty node 10-10-10-153 fails to be synchronized to a standby cluster whose **PeerID** is **abc1**. + +9. .. _alm-19006__li6555881219413: + + Run the **list_peers** command to find the cluster and the HBase instance corresponding to the **PeerID** value. + + .. code-block:: + + PEER_ID CLUSTER_KEY STATE TABLE_CFS + abc1 10.10.10.110,10.10.10.119,10.10.10.133:2181:/hbase2 ENABLED + abc 10.10.10.110,10.10.10.119,10.10.10.133:2181:/hbase ENABLED + + In the preceding information, **/hbase2** indicates that data is synchronized to the HBase2 instance of the standby cluster. + +10. In the service list of FusionInsight Manager of the standby cluster, check whether the running status of the HBase instance obtained by using :ref:`9 ` is **Normal**. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`11 `. + +11. .. _alm-19006__li448244019413: + + In the alarm list, check whether the **ALM-19000 HBase Service Unavailable** alarm is generated. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`14 `. + +12. .. _alm-19006__li2753337519413: + + Follow troubleshooting procedures in **ALM-19000 HBase Service Unavailable** to rectify the fault. + +13. Wait for a few minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Check network connections between RegionServers on active and standby clusters.** + +14. .. _alm-19006__li2284519319413: + + Log in to the FusionInsight Manager portal of the active cluster, and click **O&M** > **Alarm** > **Alarms.** + +15. .. _alm-19006__li3322104919413: + + In the alarm list, click the alarm to obtain **HostName** from **Location**. + +16. Use the IP address obtained in :ref:`15 ` to log in to a faulty RegionServer node as user **omm**. + +17. Run the **ping** command to check whether network connections between the faulty RegionServer node and the host where RegionServer of the standby cluster resides are in the normal state. + + - If yes, go to :ref:`20 `. + - If no, go to :ref:`18 `. + +18. .. _alm-19006__li5820706019413: + + Contact the network administrator to restore the network. + +19. After the network is running properly, check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`20 `. + +**Collect fault information.** + +20. .. _alm-19006__li342888619413: + + On the FusionInsight Manager interface of active and standby clusters, choose **O&M** > **Log** > **Download**. + +21. In the **Service** drop-down list box, select **HBase** in the required cluster. + +22. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +23. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417420.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19007_hbase_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19007_hbase_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..72f438e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19007_hbase_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,124 @@ +:original_name: ALM-19007.html + +.. _ALM-19007: + +ALM-19007 HBase GC Time Exceeds the Threshold +============================================= + +Description +----------- + +The system checks the old generation garbage collection (GC) time of the HBase service every 60 seconds. This alarm is generated when the detected old generation GC time exceeds the threshold (exceeds 5 seconds for three consecutive checks by default). To change the threshold, on the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **HBase > GC >** **GC time for old generation**. This alarm is cleared when the old generation GC time of the HBase service is shorter than or equal to the threshold. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19007 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the old generation GC time exceeds the threshold, HBase data read and write are affected. + +Possible Causes +--------------- + +The memory of HBase instances is overused, the heap memory is inappropriately allocated, or a large number of I/O operations exist in HBase. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On the FusionInsight Manager portal, click **O&M** > **Alarm** > **Alarms** and select the alarm whose **ID** is **19007**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HMaster, go to :ref:`2 `. + - If the role for which the alarm is generated is RegionServer, go to :ref:`3 `. + +#. .. _alm-19007__li56776013194753: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the HMaster for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **GC** > **Garbage Collection (GC) Time of HMaster** and click **OK** to check whether the value of **GC time** **for old generation** is greater than the threshold (exceeds 5 seconds for three consecutive checks periods by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19007__li29806005194753: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the RegionServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **GC** > **Garbage Collection (GC) Time of RegionServer** and click **OK** to check whether the value of **GC time** **for old generation** is greater than the threshold (exceeds 5 seconds for three consecutive checks periods by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +**Check the current JVM configuration.** + +4. .. _alm-19007__li25514146194753: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations**, and click **All** **Configurations**. In Search, enter **GC_OPTS** to check the **GC_OPTS** memory parameter of role HMaster(HBase->HMaster), RegionServer(HBase->RegionServer). Adjust the values of **-Xmx** and **-XX:CMSInitiatingOccupancyFraction** of the GC_OPTS parameter by referring to the Note. + + .. note:: + + a. Suggestions on GC parameter configurations for HMaster + + - Set **-Xms** and **-Xmx** to the same value to prevent JVM from dynamically adjusting the heap memory size and affecting performance. + - Set **-XX:NewSize** to the value of **-XX:MaxNewSize**, which is one eighth of **-Xmx**. + - For large-scale HBase clusters with a large number of regions, increase values of **GC_OPTS** parameters for HMaster. Specifically, set **-Xmx** to 4 GB if the number of regions is less than 100,000. If the number of regions is more than 100,000, set -Xmx to be greater than or equal to 6 GB. For each increased 35,000 regions, increase the value of **-Xmx** by 2 GB. The maximum value of **-Xmx** is 32 GB. + + b. Suggestions on GC parameter configurations for RegionServer + + - Set **-Xms** and **-Xmx** to the same value to prevent JVM from dynamically adjusting the heap memory size and affecting performance. + - Set **-XX:NewSize** to one eighth of **-Xmx**. + - Set the memory for RegionServer to be greater than that for HMaster. If sufficient memory is available, increase the heap memory. + - Set **-Xmx** based on the machine memory size. Specifically, set **-Xmx** to 32 GB if the machine memory is greater than 200 GB, to 16 GB if the machine memory is greater than 128 GB and less than 200 GB, and to 8 GB if the machine memory is less than 128 GB. When **-Xmx** is set to 32 GB, a RegionServer node supports 2000 regions and 200 hotspot regions. + - **XX:CMSInitiatingOccupancyFraction** to be less than and equal to **85**, and it is calculated as follows: 100 x (hfile.block.cache.size + hbase.regionserver.global.memstore.size) + +5. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-19007__li55997378194753: + + On the FusionInsight Manager interface of active and standby clusters, choose **O&M** > **Log** > **Download**. + +7. In the **Service** drop-down list box, select **HBase** in the required cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417421.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19008_heap_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19008_heap_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..a4ec2a2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19008_heap_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst @@ -0,0 +1,121 @@ +:original_name: ALM-19008.html + +.. _ALM-19008: + +ALM-19008 Heap Memory Usage of the HBase Process Exceeds the Threshold +====================================================================== + +Description +----------- + +The system checks the HBase service status every 30 seconds. The alarm is generated when the heap memory usage of an HBase service exceeds the threshold (90% of the maximum memory). + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available HBase heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the HBase service is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, click **O&M** > **Alarm** > **Alarms** and select the alarm whose **ID** is **19008**. Then check the role name in **Location** and confirm the IP adress of the instance. + + - If the role for which the alarm is generated is HMaster, go to :ref:`2 `. + - If the role for which the alarm is generated is RegionServer, go to :ref:`3 `. + +#. .. _alm-19008__li11316139195110: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the HMaster for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory** > **HMaster Heap Memory Usage and Direct Memory Usage Statistics** and click **OK**, check whether the used heap memory of the HBase service reaches 90% of the maximum heap memory specified for HBase. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19008__li44755158195110: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the RegionServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory** > **RegionServer Heap Memory Usage and Direct Memory Usage Statistics** and click **OK**, check whether the used heap memory of the HBase service reaches 90% of the maximum heap memory specified for HBase. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19008__li27009410195110: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations**, and click **All Configurations**. Choose **HMaster/RegionServer** > **System**. Increase the value of **-Xmx** in **GC_OPTS** by referring to the Note. + + .. note:: + + a. Suggestions on GC parameter configurations for HMaster + + - Set **-Xms** and **-Xmx** to the same value to prevent JVM from dynamically adjusting the heap memory size and affecting performance. + - Set **-XX:NewSize** to the value of **-XX:MaxNewSize**, which is one eighth of **-Xmx**. + - For large-scale HBase clusters with a large number of regions, increase values of **GC_OPTS** parameters for HMaster. Specifically, set **-Xmx** to 4 GB if the number of regions is less than 100,000. If the number of regions is more than 100,000, set -Xmx to be greater than or equal to 6 GB. For each increased 35,000 regions, increase the value of **-Xmx** by 2 GB. The maximum value of **-Xmx** is 32 GB. + + b. Suggestions on GC parameter configurations for RegionServer + + - Set **-Xms** and **-Xmx** to the same value to prevent JVM from dynamically adjusting the heap memory size and affecting performance. + - Set **-XX:NewSize** to one eighth of **-Xmx**. + - Set the memory for RegionServer to be greater than that for HMaster. If sufficient memory is available, increase the heap memory. + - Set **-Xmx** based on the machine memory size. Specifically, set **-Xmx** to 32 GB if the machine memory is greater than 200 GB, to 16 GB if the machine memory is greater than 128 GB and less than 200 GB, and to 8 GB if the machine memory is less than 128 GB. When **-Xmx** is set to 32 GB, a RegionServer node supports 2000 regions and 200 hotspot regions. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-19008__li56360562195110: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select **HBase** in the required cluster from the **Service** drop-down list. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417422.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19009_direct_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19009_direct_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..ddc324d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19009_direct_memory_usage_of_the_hbase_process_exceeds_the_threshold.rst @@ -0,0 +1,123 @@ +:original_name: ALM-19009.html + +.. _ALM-19009: + +ALM-19009 Direct Memory Usage of the HBase Process Exceeds the Threshold +======================================================================== + +Description +----------- + +The system checks the HBase service status every 30 seconds. The alarm is generated when the direct memory usage of an HBase service exceeds the threshold (90% of the maximum memory). + +The alarm is cleared when the direct memory usage is less than the threshold. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19009 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available HBase direct memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the HBase service is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check direct memory usage.** + +#. On the FusionInsight Manager portal, click **O&M** > **Alarm** > **Alarms** and select the alarm whose **ID** is **19009**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + + - If the role for which the alarm is generated is HMaster, go to :ref:`2 `. + - If the role for which the alarm is generated is RegionServer, go to :ref:`3 `. + +#. .. _alm-19009__li51947016195425: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the HMaster for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory** > **HMaster Heap Memory Usage and Direct Memory Usage Statistics** and click **OK** to check whether the used direct memory of the HBase service reaches 90% of the maximum direct memory specified for HBase. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-19009__li17000576195425: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance** and click the RegionServer for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the **Chart** area and choose **Customize** > **CPU and Memory** > **RegionServer Heap Memory Usage and Direct Memory Usage Statistics** and click **OK** to check whether the used direct memory of the HBase service reaches 90% of the maximum direct memory specified for HBase. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`8 `. + +#. .. _alm-19009__li30000576195425: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations**, and click **All Configurations**. Choose **HMaster/RegionServer** > **System** and check whether **XX:MaxDirectMemorySize** exists in **GC_OPTS**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19009__li131714294313: + + On the FusionInsight Manager portal, choose **Cluster** > *Nameof the desired cluster* > **Services** > **HBase** > **Configurations**, and click **All Configurations**. Choose **HMaster/RegionServer** > **System** and delete **XX:MaxDirectMemorySize** from **GC_OPTS**. + +#. .. _alm-19009__li336333834: + + Check whether the **ALM-19008 Heap Memory Usage of the HBase Process Exceeds the Threshold** alarm is generated. + + If yes, handle the alarm by referring to **ALM-19008 Heap Memory Usage of the HBase Process Exceeds the Threshold**. + + If no, go to :ref:`8 `. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-19009__li62317418195425: + + On the FusionInsight Manager interface of active and standby clusters, choose **O&M** > **Log** > **Download**. + +9. In the **Service** in the required cluster drop-down list box, select **HBase**. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417423.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19011_regionserver_region_number_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19011_regionserver_region_number_exceeds_the_threshold.rst new file mode 100644 index 0000000..570e199 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19011_regionserver_region_number_exceeds_the_threshold.rst @@ -0,0 +1,223 @@ +:original_name: ALM-19011.html + +.. _ALM-19011: + +ALM-19011 RegionServer Region Number Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the number of regions on each RegionServer in each HBase service instance every 30 seconds. The region number is displayed on the HBase service monitoring page and RegionServer role monitoring page. This alarm is generated when the number of regions on a RegionServer exceeds the threshold (default value: 2000) for 20 consecutive times. The threshold can be changed by choosing **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **HBase**. This alarm is cleared when the number of regions is less than or equal to the threshold. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19011 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The data read/write performance of HBase is affected when the number of regions on a RegionServer exceeds the threshold. + +Possible Causes +--------------- + +- The RegionServer region distribution is unbalanced. +- The HBase cluster scale is too small. + +Procedure +--------- + +**View alarm location information.** + +#. On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Alarms**, select this alarm, and view the service instance and host name in **Location**. + +#. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services**, click the HBase service instance for which the alarm is generated, and click **HMaster(Active)**. On the displayed WebUI of the HBase instance, check whether the region distribution on the RegionServer is balanced. + + .. note:: + + By default, the **admin** user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`3 `. + + + .. figure:: /_static/images/en-us_image_0276801805.png + :alt: **Figure 1** WebUI of HBase instance + + **Figure 1** WebUI of HBase instance + +**Enable load balancing.** + +3. .. _alm-19011__li1094914149149: + + Log in to the node where the HBase client is located as user **root**. Go to the client installation directory, and set environment variables. + + **cd** *client installation directory* + + **source bigdata_env** + + If the cluster adopts the security mode, perform security authentication. Specifically, run the **kinit hbase** command and enter the password as prompted (obtain the password from the administrator). + +4. Run the following commands to go to the HBase shell command window and check whether the load balancing function is enabled. + + **hbase shell** + + **balancer_enabled** + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`5 `. + +5. .. _alm-19011__li169541814111411: + + On the HBase shell command window, run the following commands to enable the load balancing function and check whether the function is enabled. + + **balance_switch true** + + **balancer_enabled** + +6. .. _alm-19011__li10954151421410: + + On the HBase shell command window, run the **balancer** command to manually trigger the load balancing function. + + .. note:: + + You are advised to enable and manually trigger the load balancing function during off-peak hours. + +7. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**, and click **HMaster(Active)**. On the displayed WebUI of the HBase instance, refresh the page and check whether the region distribution is balanced. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`21 `. + +8. .. _alm-19011__li09541614181420: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Delete unwanted HBase tables.** + +.. note:: + + Exercise caution when deleting data to ensure data is deleted correctly. + +9. .. _alm-19011__li1035192916147: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**, and click **HMaster(Active)**. On the displayed WebUI of the HBase instance, view tables stored in the HBase service instance and record unwanted tables that can be deleted. + +10. On the HBase shell command window, run the **disable** command and **drop** command to delete the table to decrease the number of regions. + + **disable '**\ *name of the table to be deleted'* + + **drop '**\ *name of the table to be deleted'* + +11. On the HBase shell command window, run the following command to check whether the load balancing function is enabled. + + **balancer_enabled** + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`12 `. + +12. .. _alm-19011__li33682961411: + + On the HBase shell command window, run the following commands to enable the load balancing function and confirm that the function is enabled. + + **balance_switch true** + + **balancer_enabled** + +13. .. _alm-19011__li236102961418: + + On the HBase shell command window, run the **balancer** command to manually trigger the load balancing function. + +14. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**, and click **HMaster(Active)**. On the displayed WebUI of the HBase instance, refresh the page and check whether the region distribution is balanced. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`21 `. + +15. .. _alm-19011__li113716297149: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +**Adjust the threshold.** + +16. .. _alm-19011__li3975164521415: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **HBase** > **Regions(RegionServer)**, select the applied rule, and click **Modify** to check whether the threshold is proper. + + - If it is excessively small, increase the threshold as required and go to :ref:`17 `. + - If it is proper, go to :ref:`18 `. + +17. .. _alm-19011__li14975174511413: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + + **Perform system capacity expansion.** + +18. .. _alm-19011__li4975174511141: + + Add nodes to the HBase cluster and add RegionServer instances to the nodes. Then enable and manually trigger the load balancing function. + +19. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services**, click the HBase service instance for which the alarm is generated, and click **HMaster(Active)**. On the displayed WebUI of the HBase instance, refresh the page and check whether the region distribution is balanced. + + - If yes, go to :ref:`20 `. + - If no, go to :ref:`21 `. + +20. .. _alm-19011__li119761945121413: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`21 `. + + **Collect fault information.** + +21. .. _alm-19011__li697624581415: + + On the FusionInsight Manager home page of the active and standby clusters, choose **O&M**> **Log** > **Download**. + +22. Select **HBase** in the required cluster from the **Service**. + +23. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +24. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417427.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19012_hbase_system_table_directory_or_file_lost.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19012_hbase_system_table_directory_or_file_lost.rst new file mode 100644 index 0000000..ac49eb5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19012_hbase_system_table_directory_or_file_lost.rst @@ -0,0 +1,100 @@ +:original_name: ALM-19012.html + +.. _ALM-19012: + +ALM-19012 HBase System Table Directory or File Lost +=================================================== + +Description +----------- + +The system checks whether HBase directories and files exist on the HDFS every 120 seconds. This alarm is generated when the system detects that the files or directories do not exist. This alarm is cleared when the files or directories are restored. + +The HBase directories and files are as follows: + +- Directory of the namespace **hbase** on the HDFS +- **hbase.version** file +- Directory of the table **hbase:meta** on the HDFS, .tableinfo file, and .regioninfo file +- Directory of the table **hbase:namespace** on the HDFS, .tableinfo file, and .regioninfo file +- Directory of the table **hbase:hindex** on the HDFS, .tableinfo file, and .regioninfo file +- Directory of the **hbase:acl** table on the HDFS, .tableinfo, and .regioninfo file (This table does not exist in the common mode cluster by default.) + + .. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19012 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The HBase service fails to restart or start. + +Possible Causes +--------------- + +Files or directories on the HDFS are missing. + +Procedure +--------- + +**Locate the alarm cause.** + +#. On the FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Click this alarm and check whether **Alarm Cause** indicates unknown errors. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 ` + +#. .. _alm-19012__li5458941113112: + + On the FusionInsight Manager home page, choose **O&M** > **Backup and Restoration** > **Backup Management**. Check whether there are success records of the backup task named **default** or other HBase metadata backup tasks that have been successfully executed. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-19012__li164581941183116: + + Use the latest backup metadata to restore the metadata of the HBase service. + +**Collect fault information.** + +4. .. _alm-19012__li11456104183119: + + On the FusionInsight Manager page of the active and standby clusters, choose **O&M** > **Log** > **Download**. + +5. In the **Service** area, select faulty HBase services in the required cluster. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417428.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19013_duration_of_regions_in_transaction_state_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19013_duration_of_regions_in_transaction_state_exceeds_the_threshold.rst new file mode 100644 index 0000000..2576f0b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19013_duration_of_regions_in_transaction_state_exceeds_the_threshold.rst @@ -0,0 +1,122 @@ +:original_name: ALM-19013.html + +.. _ALM-19013: + +ALM-19013 Duration of Regions in transaction State Exceeds the Threshold +======================================================================== + +Description +----------- + +The system checks the number of regions in transaction state on HBase every 300 seconds. This alarm is generated when the system detects that the duration of regions in transaction state exceeds the threshold for two consecutive times. This alarm is cleared when all timeout regions are restored. + +.. note:: + + If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the HBase1 service is unavailable, **ServiceName=HBase1** is displayed in **Location**, and the operation object in the procedure needs to be changed from HBase to HBase1. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +19013 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Some data in the table gets lost or becomes unavailable. + +Possible Causes +--------------- + +- Compaction is permanently blocked. +- The HDFS files are abnormal. + +Procedure +--------- + +**Locate the alarm cause.** + +#. On the FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, select this alarm, and view the **HostName** and **RoleName** in **Location**. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services > HBase**, Click the drop-down menu in the chartarea and choose **Customize >** **Service** > + + **Region in transaction count** to view **Region in transaction count over threshold**. Check whether the monitoring item detects a value in three consecutive detection periods. (The default threshold is 60 seconds.) + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-19013__li0444398318: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **HMaster (Active)** > **Tables** to check whether the regions of only one table transaction status time out. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-19013__li1318724573113: + + Run the **hbase hbck** command on the client and check whether the error message "No table descriptor file under hdfs://hacluster/hbase/data/default/table" is displayed. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm-19013__li417435203115: + + Log in to the client as user **root**. Run the following command: + + **cd** *client installation directory* + + **source bigdata_env** + + If the cluster is in security mode, run the **kinit hbase** command + + Log in to the HMaster WebUI, choose **Procedure & Locks** in the navigation tree, and check whether any process ID is in the **Waiting** state in **Procedures**. If yes, run the following command to release the procedure lock: + + **hbase hbck -j** *client installation directory*\ **/HBase/hbase/tools/hbase-hbck2-*.jar bypass -o** *pid* + + Check whether the state is in the **Bypass** state. If the procedure on the UI is always in **RUNNABLE(Bypass)** state, perform an active/standby switchover. Run the **assigns** command to bring the region online again. + + **hbase hbck -j** *client installation directory*\ **/HBase/hbase/tools/hbase-hbck2-*.jar assigns -o** *regionName* + +#. Repeat :ref:`4 `. Run the **hbase hbck** command on the client and check whether the error message "No table descriptor file under hdfs://hacluster/hbase/data/default/table" is displayed. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +**Collect fault information.** + +7. .. _alm-19013__li11456104183119: + + On the FusionInsight Manager page of the active and standby clusters, choose **O&M** > **Log** > **Download**. + +8. In the **Service** area, select faulty HBase services in the required cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417429.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19014_capacity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19014_capacity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst new file mode 100644 index 0000000..329a43a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19014_capacity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst @@ -0,0 +1,133 @@ +:original_name: ALM-19014.html + +.. _ALM-19014: + +ALM-19014 Capacity Quota Usage on ZooKeeper Exceeds the Threshold Severely +========================================================================== + +Description +----------- + +The system checks the ZNode usage of the HBase service every 120 seconds. This alarm is generated when the ZNode capacity usage of the HBase service exceeds the critical alarm threshold (90% by default). + +This alarm is cleared when the ZNode capacity usage is less than the critical alarm threshold. + +.. note:: + + If the multi-instance function has been enabled in the cluster and multiple HBase services have been installed, determine the HBase service for which the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the value of **ServiceName** is **HBase-1**, change the operation object in the procedure from **HBase** to **HBase-1**. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19014 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ========================================================= +Name Meaning +=========== ========================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Threshold Specifies the threshold for which the alarm is generated. +=========== ========================================================= + +Impact on the System +-------------------- + +This alarm indicates that the capacity usage of the ZNode of HBase has exceeded the threshold severely. As a result, the write request of the HBase service fails. + +Possible Causes +--------------- + +- DR is configured for HBase, and data synchronization fails or is slow in DR. +- A large number of WAL files are being split in the HBase cluster. + +Procedure +--------- + +**Check the capacity configuration and usage of ZNodes.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, select the alarm whose ID is **19014**, and view the threshold in **Additional Information**. + +#. Log in to the HBase client as user **root**. Run the following command to go to the client installation directory: + + **cd** *Client installation directory* + + Run the following command to set environment variables: + + **source bigdata_env** + + If the cluster uses the security mode, run the following command to perform security authentication: + + **kinit hbase** + + Enter the password as prompted (obtain the password from the MRS cluster administrator). + +#. Run the **hbase zkcli** command to log in to the ZooKeeper client and run the **listquota /hbase** command to check the ZNode capacity quota of the HBase service. The ZNode root directory in the command is specified by the **zookeeper.znode.parent** parameter of the HBase service. The marked area in the following figure shows the capacity configuration of the root ZNode of the HBase service. + + |image1| + +#. Run the **getusage /hbase/splitWAL** command to check the capacity usage of the ZNode. Check whether the ratio of **Data size** to the ZNode capacity quota is close to the alarm threshold. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19014__li6339011093624: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **12007**, **19000**, or **19013** and the **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19014__li62018222616: + + Run the **getusage /hbase/replication** command to check the capacity usage of the ZNode. Check whether the ratio of **Data size** to the ZNode capacity quota is close to the alarm threshold. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19014__li17555915687: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **19006** and **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19014__li5814142393624: + + Check whether the alarm is cleared five minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-19014__li5351076393624: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +10. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **HBase** for the target cluster. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001390938104.png +.. |image2| image:: /_static/images/en-us_image_0263895386.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19015_quantity_quota_usage_on_zookeeper_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19015_quantity_quota_usage_on_zookeeper_exceeds_the_threshold.rst new file mode 100644 index 0000000..09d6218 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19015_quantity_quota_usage_on_zookeeper_exceeds_the_threshold.rst @@ -0,0 +1,133 @@ +:original_name: ALM-19015.html + +.. _ALM-19015: + +ALM-19015 Quantity Quota Usage on ZooKeeper Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the ZNode usage of the HBase service every 120 seconds. This alarm is generated when the system detects that the ZNode quantity usage of the HBase service exceeds the alarm threshold (75% by default). + +This alarm is cleared when the ZNode quantity usage is less than the alarm threshold. + +.. note:: + + If the multi-instance function has been enabled in the cluster and multiple HBase services have been installed, determine the HBase service for which the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the value of **ServiceName** is **HBase-1**, change the operation object in the procedure from **HBase** to **HBase-1**. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19015 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ========================================================= +Name Meaning +=========== ========================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Threshold Specifies the threshold for which the alarm is generated. +=========== ========================================================= + +Impact on the System +-------------------- + +This alarm indicates that the ZNode quantity usage in the HBase service has exceeded the threshold. If this alarm is not handled in a timely manner, the problem severity may be escalated to **Critical**, affecting data writing. + +Possible Causes +--------------- + +- DR is configured for HBase, and data synchronization fails or is slow in DR. +- A large number of WAL files are being split in the HBase cluster. + +Procedure +--------- + +**Check the quantity quota and usage of ZNodes.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, select the alarm whose ID is **19015**, and view the threshold in **Additional Information**. + +#. Log in to the HBase client as user **root**. Run the following command to go to the client installation directory: + + **cd** *Client installation directory* + + Run the following command to set environment variables: + + **source bigdata_env** + + If the cluster uses the security mode, run the following command to perform security authentication: + + **kinit hbase** + + Enter the password as prompted (obtain the password from the MRS cluster administrator). + +#. Run the **hbase zkcli** command to log in to the ZooKeeper client and run the **listquota /hbase** command to check the ZNode quantity quota of the HBase service. The ZNode root directory in the command is specified by the **zookeeper.znode.parent** parameter of the HBase service. The marked area in the following figure shows the quantity quota configuration of the root ZNode of the HBase service. + + |image1| + +#. Run the **getusage /hbase/splitWAL** command to check the ZNode quantity usage and check whether the ratio of **Node count** in the command output to the ZNode quantity quota is close to the alarm threshold. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19015__li1269371131112: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **12007**, **19000**, or **19013** and the **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19015__li62018222616: + + Run the **getusage /hbase/replication** command to check the ZNode quantity usage and check whether the ratio of **Node count** in the command output to the ZNode quantity quota is close to the alarm threshold. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19015__li17555915687: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **19006** and **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19015__li9693191171112: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-19015__li146938101112: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001441097977.png +.. |image2| image:: /_static/images/en-us_image_0263895577.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19016_quantity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19016_quantity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst new file mode 100644 index 0000000..7eb73fb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19016_quantity_quota_usage_on_zookeeper_exceeds_the_threshold_severely.rst @@ -0,0 +1,133 @@ +:original_name: ALM-19016.html + +.. _ALM-19016: + +ALM-19016 Quantity Quota Usage on ZooKeeper Exceeds the Threshold Severely +========================================================================== + +Description +----------- + +The system checks the ZNode usage of the HBase service every 120 seconds. This alarm is generated when the znode usage of the HBase service exceeds the critical alarm threshold (90% by default). + +This alarm is cleared when the quantity usage of the ZNode is less than the critical alarm threshold. + +.. note:: + + If the multi-instance function has been enabled in the cluster and multiple HBase services have been installed, determine the HBase service for which the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the value of **ServiceName** is **HBase-1**, change the operation object in the procedure from **HBase** to **HBase-1**. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19016 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ========================================================= +Name Meaning +=========== ========================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Threshold Specifies the threshold for which the alarm is generated. +=========== ========================================================= + +Impact on the System +-------------------- + +This alarm indicates that the quantity usage of the ZNode of HBase has exceeded the threshold severely. As a result, the write request of the HBase service fails. + +Possible Causes +--------------- + +- DR is configured for HBase, and data synchronization fails or is slow in DR. +- A large number of WAL files are being split in the HBase cluster. + +Procedure +--------- + +**Check the quantity quota and usage of ZNodes.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, select the alarm whose ID is **19016**, and view the threshold in **Additional Information**. + +#. Log in to the HBase client as user **root**. Run the following command to go to the client installation directory: + + **cd** *Client installation directory* + + Run the following command to set environment variables: + + **source bigdata_env** + + If the cluster uses the security mode, run the following command to perform security authentication: + + **kinit hbase** + + Enter the password as prompted (obtain the password from the MRS cluster administrator). + +#. Run the **hbase zkcli** command to log in to the ZooKeeper client and run the **listquota /hbase** command to check the ZNode quantity quota of the HBase service. The ZNode root directory in the command is specified by the **zookeeper.znode.parent** parameter of the HBase service. The marked area in the following figure shows the quantity configuration of the root ZNode of the HBase service. + + |image1| + +#. Run the **getusage /hbase/splitWAL** command to check the ZNode usage and check whether the ratio of **Node count** in the command output to the znode quantity quota is close to the alarm threshold. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19016__li6339011093624: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **12007**, **19000**, or **19013** and the **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19016__li62018222616: + + Run the **getusage /hbase/replication** command to check the ZNode usage and check whether the ratio of **Node count** in the command output to the ZNode quantity quota is close to the alarm threshold. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19016__li17555915687: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **19006** and **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19016__li5814142393624: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-19016__li5351076393624: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001390618824.png +.. |image2| image:: /_static/images/en-us_image_0263895751.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19017_capacity_quota_usage_on_zookeeper_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19017_capacity_quota_usage_on_zookeeper_exceeds_the_threshold.rst new file mode 100644 index 0000000..e53801b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19017_capacity_quota_usage_on_zookeeper_exceeds_the_threshold.rst @@ -0,0 +1,133 @@ +:original_name: ALM-19017.html + +.. _ALM-19017: + +ALM-19017 Capacity Quota Usage on ZooKeeper Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the ZNode usage of the HBase service every 120 seconds. This alarm is generated when the system detects that the ZNodes capacity usage of the HBase service exceeds the alarm threshold (75% by default). + +This alarm is cleared when the capacity usage of the ZNode capacity is less than the threshold. + +.. note:: + + If the multi-instance function has been enabled in the cluster and multiple HBase services have been installed, determine the HBase service for which the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the value of **ServiceName** is **HBase-1**, change the operation object in the procedure from **HBase** to **HBase-1**. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19017 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ========================================================= +Name Meaning +=========== ========================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +Threshold Specifies the threshold for which the alarm is generated. +=========== ========================================================= + +Impact on the System +-------------------- + +This alarm indicates that the ZNodes capacity usage in the HBase service has exceeded the threshold. If this alarm is not handled in a timely manner, the problem severity may be escalated to **Critical**, affecting data writing. + +Possible Causes +--------------- + +- DR is configured for HBase, and data synchronization fails or is slow in DR. +- A large number of WAL files are being split in the HBase cluster. + +Procedure +--------- + +**Check the capacity configuration and usage of ZNodes.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, select the alarm whose ID is **19017**, and view the threshold in **Additional Information**. + +#. Log in to the HBase client as user **root**. Run the following command to go to the client installation directory: + + **cd** *Client installation directory* + + Run the following command to set environment variables: + + **source bigdata_env** + + If the cluster uses the security mode, run the following command to perform security authentication: + + **kinit hbase** + + Enter the password as prompted (obtain the password from the MRS cluster administrator). + +#. Run the **hbase zkcli** command to log in to the ZooKeeper client and run the **listquota /hbase** command to check the ZNode quantity quota of the HBase service. The ZNode root directory in the command is specified by the **zookeeper.znode.parent** parameter of the HBase service. The marked area in the following figure shows the quantity configuration of the root ZNode of the HBase service. + + |image1| + +#. Run the **getusage /hbase/splitWAL** command to check the capacity usage of the ZNode. Check whether the ratio of **Data size** to the ZNode capacity quota is close to the alarm threshold. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-19017__li6339011093624: + + On FusionInsight Manager, check whether the alarm whose ID is **12007**, **19000**, or **19013** and **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`7 `. + +#. .. _alm-19017__li62018222616: + + Run the **getusage /hbase/replication** command to check the capacity usage of the ZNode. Check whether the ratio of **Data size** to the ZNode capacity quota is close to the alarm threshold. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19017__li17555915687: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Check whether the alarm whose ID is **19006** and **ServiceName** in **Location** is the current HBase service exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-19017__li5814142393624: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-19017__li5351076393624: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +11. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001440858625.png +.. |image2| image:: /_static/images/en-us_image_0263895513.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19018_hbase_compaction_queue_size_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19018_hbase_compaction_queue_size_exceeds_the_threshold.rst new file mode 100644 index 0000000..c578083 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19018_hbase_compaction_queue_size_exceeds_the_threshold.rst @@ -0,0 +1,95 @@ +:original_name: ALM-19018.html + +.. _ALM-19018: + +ALM-19018 HBase Compaction Queue Size Exceeds the Threshold +=========================================================== + +Description +----------- + +The system checks the HBase compaction queue size every 300 seconds. This alarm is generated when the compaction queue size exceeds the alarm threshold (**100** by default). This alarm is cleared when the compaction queue size is less than the threshold. + +.. note:: + + If the multi-instance function has been enabled in the cluster and multiple HBase services have been installed, determine the HBase service for which the alarm is generated based on the value of **ServiceName** in **Location**. For example, if the value of **ServiceName** is **HBase-1**, change the operation object in the procedure from **HBase** to **HBase-1**. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19018 Minor Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster performance may deteriorate, affecting data read and write. + +Possible Causes +--------------- + +- The number of HBase RegionServers is too small. +- There are excessive regions on a single RegionServer of HBase. +- The HBase RegionServer heap size is small. +- Resources are insufficient. +- Related parameters are not configured properly. + +Procedure +--------- + +**Check whether related parameters are properly configured.** + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. On the page that is displayed, check whether the alarm whose **Alarm ID** is **19008** or **19011** exists. + + - If yes, click **View Help** next to the alarm and rectify the fault by referring to the help document. Then, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm-19018__li1681162011376: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. On the page that is disaplyed, click the **Configuration** tab then the **All Configurations** sub-tab, search for **hbase.hstore.compaction.min**, **hbase.hstore.compaction.max**, **hbase.regionserver.thread.compaction.small**, and **hbase.regionserver.thread.compaction.throttle**, and set them to larger values. + +#. .. _alm-19018__li5814142393624: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect the fault information.** + +4. .. _alm-19018__li5351076393624: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +5. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895551.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19019_number_of_hbase_hfiles_to_be_synchronized_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19019_number_of_hbase_hfiles_to_be_synchronized_exceeds_the_threshold.rst new file mode 100644 index 0000000..3cb5a23 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19019_number_of_hbase_hfiles_to_be_synchronized_exceeds_the_threshold.rst @@ -0,0 +1,162 @@ +:original_name: ALM-19019.html + +.. _ALM-19019: + +ALM-19019 Number of HBase HFiles to Be Synchronized Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the number of HFiles to be synchronized by the RegionServer of each HBase service instance every 30 seconds. This indicator can be viewed on the RegionServer role monitoring page. This alarm is generated when the number of HFiles to be synchronized on a RegionServer exceeds the threshold (exceeding 128 for 20 consecutive times by default). To change the threshold, choose **O&M** > **Alarm** > **Threshold Configuration** > *Name of the desired cluster* > **HBase** . This alarm is cleared when the number of HFiles to be synchronized is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19019 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the number of HFiles to be synchronized by a RegionServer exceeds the threshold, the number of ZNodes used by HBase exceeds the threshold, affecting the HBase service status. + +Possible Causes +--------------- + +- The network is abnormal. +- The RegionServer region distribution is unbalanced. +- The HBase service scale of the standby cluster is too small. + +Procedure +--------- + +View alarm location information. + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing the alarm whose **Alarm ID** is **19019**, and view the service instance and host name in **Location**. + +Check the network connection between RegionServers on active and standby clusters. + +2. Run the **ping** command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +3. .. _alm-19019__li1946854011118: + + Contact the network administrator to restore the network. + +4. After the network recovers, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +Check the RegionServer region distribution in the active cluster. + +5. .. _alm-19019__li1686411529911: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. Click **HMaster(Active)** to go to the web UI of the HBase instance and check whether regions are evenly distributed on the Region Server. + +6. .. _alm-19019__li277716529115: + + Log in to the faulty RegionServer node as user **omm**. + +7. Run the following commands to go to the client installation directory and set the environment variable: + + **cd** *Client installation directory* + + **source bigdata_env** + + If the cluster uses the security mode, perform security authentication. Run the **kinit hbase** command and enter the password as prompted (obtain the password from the MRS cluster administrator). + +8. Run the following commands to check whether the load balancing function is enabled. + + **hbase shell** + + **balancer_enabled** + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`9 `. + +9. .. _alm-19019__li8778145241118: + + Run the following commands in HBase Shell to enable the load balancing function and check whether the function is enabled. + + **balance_switch true** + + **balancer_enabled** + +10. .. _alm-19019__li127781952161113: + + Run the **balancer** command to manually trigger the load balancing function. + + .. note:: + + You are advised to enable and manually trigger the load balancing function during off-peak hours. + +11. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +Check the HBase service scale of the standby cluster. + +12. .. _alm-19019__li14354010126: + + Expand the HBase cluster, add a node, and add a RegionServer instance on the node. Then, perform :ref:`6 ` to :ref:`10 ` to enable the load balancing function and manually trigger it. + +13. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. Click **HMaster(Active)** to go to the web UI of the HBase instance, refresh the page, and check whether regions are evenly distributed. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-19019__li435514181217: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Collect the fault information.** + +15. .. _alm-19019__li193977212510: + + On FusionInsight Manager of the standby cluster, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +16. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001159690571.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19020_number_of_hbase_wal_files_to_be_synchronized_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19020_number_of_hbase_wal_files_to_be_synchronized_exceeds_the_threshold.rst new file mode 100644 index 0000000..d780a19 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-19020_number_of_hbase_wal_files_to_be_synchronized_exceeds_the_threshold.rst @@ -0,0 +1,162 @@ +:original_name: ALM-19020.html + +.. _ALM-19020: + +ALM-19020 Number of HBase WAL Files to Be Synchronized Exceeds the Threshold +============================================================================ + +Description +----------- + +The system checks the number of WAL files to be synchronized by the RegionServer of each HBase service instance every 30 seconds. This indicator can be viewed on the RegionServer role monitoring page. This alarm is generated when the number of WAL files to be synchronized on a RegionServer exceeds the threshold (exceeding 128 for 20 consecutive times by default). To change the threshold, choose **O&M** > **Alarm** > **Threshold Configuration** > *Name of the desired cluster* > **HBase** . This alarm is cleared when the number of WAL files to be synchronized is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19020 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the number of WAL files to be synchronized by a RegionServer exceeds the threshold, the number of ZNodes used by HBase exceeds the threshold, affecting the HBase service status. + +Possible Causes +--------------- + +- The network is abnormal. +- The RegionServer region distribution is unbalanced. +- The HBase service scale of the standby cluster is too small. + +Procedure +--------- + +View alarm location information. + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing the alarm whose **Alarm ID** is **19020**, and view the service instance and host name in **Location**. + +Check the network connection between RegionServers on active and standby clusters. + +2. Run the **ping** command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +3. .. _alm-19020__li1946854011118: + + Contact the network administrator to restore the network. + +4. After the network recovers, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +Check the RegionServer region distribution in the active cluster. + +5. .. _alm-19020__li11347192371020: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. Click **HMaster(Active)** to go to the web UI of the HBase instance and check whether regions are evenly distributed on the Region Server. + +6. .. _alm-19020__li277716529115: + + Log in to the faulty RegionServer node as user **omm**. + +7. Run the following commands to go to the client installation directory and set the environment variable: + + **cd** *Client installation directory* + + **source bigdata_env** + + If the cluster uses the security mode, perform security authentication. Run the **kinit hbase** command and enter the password as prompted (obtain the password from the MRS cluster administrator). + +8. Run the following commands to check whether the load balancing function is enabled. + + **hbase shell** + + **balancer_enabled** + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`9 `. + +9. .. _alm-19020__li8778145241118: + + Run the following commands in HBase Shell to enable the load balancing function and check whether the function is enabled. + + **balance_switch true** + + **balancer_enabled** + +10. .. _alm-19020__li127781952161113: + + Run the **balancer** command to manually trigger the load balancing function. + + .. note:: + + You are advised to enable and manually trigger the load balancing function during off-peak hours. + +11. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +Check the HBase service scale of the standby cluster. + +12. .. _alm-19020__li14354010126: + + Expand the HBase cluster, add a node, and add a RegionServer instance on the node. Then, perform :ref:`6 ` to :ref:`10 ` to enable the load balancing function and manually trigger it. + +13. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase**. Click **HMaster(Active)** to go to the web UI of the HBase instance, refresh the page, and check whether regions are evenly distributed. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-19020__li435514181217: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Collect the fault information.** + +15. .. _alm-19020__li193977212510: + + On FusionInsight Manager of the standby cluster, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +16. Expand the **Service** drop-down list, and select **HBase** for the target cluster. + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001159847251.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-20002_hue_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-20002_hue_service_unavailable.rst new file mode 100644 index 0000000..fc209d1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-20002_hue_service_unavailable.rst @@ -0,0 +1,149 @@ +:original_name: ALM-20002.html + +.. _ALM-20002: + +ALM-20002 Hue Service Unavailable +================================= + +Description +----------- + +This alarm is generated when the Hue service is unavailable. The system checks the Hue service status every 60 seconds. + +This alarm is cleared when the Hue service is normal. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +20002 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system cannot provide data loading, query, and extraction services. + +Possible Causes +--------------- + +- The internal KrbServer service on which the Hue service depends is abnormal. +- The internal DBService service on which the Hue service depends is abnormal. +- The network connection to the DBService is abnormal. + +Procedure +--------- + +**Check whether the KrbServer is abnormal.** + +#. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services**. In the service list, check whether the **KrbServer** running status is **Normal**. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm-20002__li11299456195739: + + Restart the KrbServer service. + +#. Wait several minutes, and check whether **Hue Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the DBService is abnormal.** + +4. .. _alm-20002__li5904191195739: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services**. + +5. In the service list, check whether the **DBService** running status is **Normal**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-20002__li3868925195739: + + Restart the DBService. + + .. note:: + + To restart the service, enter the FusionInsight Manager administrator password. + +7. Wait several minutes, and check whether **Hue Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check whether the network connection to the DBService is normal.** + +8. .. _alm-20002__li6429027195739: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hue** > **Instance**, record the IP address of the active Hue. + +9. Log in to the active Hue. + +10. Run the **ping** command to check whether communication between the host that runs the active Hue and the hosts that run the DBService is normal. (Obtain the IP addresses of the hosts that run the DBService in the same way as that for obtaining the IP address of the active Hue.) + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`11 `. + +11. .. _alm-20002__li44855049195739: + + Contact the administrator to restore the network. + +12. Wait several minutes, and check whether **Hue Service Unavailable** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect fault information.** + +13. .. _alm-20002__li66042197195739: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +14. Select the following nodes in the required cluster from the **Service** drop-down list: + + - Hue + - Controller + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. On the FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hue**. + +17. Choose **More** > **Restart Service**, and click **OK**. + +18. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +19. .. _alm-20002__li39514705195739: + + Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417439.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24000_flume_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24000_flume_service_unavailable.rst new file mode 100644 index 0000000..90da762 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24000_flume_service_unavailable.rst @@ -0,0 +1,83 @@ +:original_name: ALM-24000.html + +.. _ALM-24000: + +ALM-24000 Flume Service Unavailable +=================================== + +Description +----------- + +The alarm module checks the Flume service status every 180 seconds. This alarm is generated if the Flume service is abnormal. + +This alarm is automatically cleared after the Flume service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Flume cannot work and data transmission is interrupted. + +Possible Causes +--------------- + +All Flume instances are faulty. + +Procedure +--------- + +#. Log in to a Flume node as user **omm** and run the **ps -ef|grep "flume.role=server"** command to check whether the Flume process exists on the node. + + - If yes, go to :ref:`3 `. + - If no, restart the faulty Flume node or Flume service and go to :ref:`2 `. + +#. .. _alm-24000__li62139541105055: + + In the alarm list, check whether alarm "Flume Service Unavailable" is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +**Collect the fault information.** + +3. .. _alm-24000__li22384958105055: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +4. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +5. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +6. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24001_flume_agent_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24001_flume_agent_exception.rst new file mode 100644 index 0000000..03f9f49 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24001_flume_agent_exception.rst @@ -0,0 +1,227 @@ +:original_name: ALM-24001.html + +.. _ALM-24001: + +ALM-24001 Flume Agent Exception +=============================== + +Description +----------- + +The Flume agent instance for which the alarm is generated cannot be started. This alarm is generated when the Flume agent process is faulty (The system checks in every 5 seconds.) or Flume agent fails to start (The system reporting alarms immediately). + +This alarm is cleared when the Flume agent process recovers, Flume agent starts successfully and the alarm handling is completed. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-----------------------------------------------------------------+ +| Name | Meaning | ++=============+=================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+-----------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-----------------------------------------------------------------+ +| AgentId | Specifies the ID of the agent for which the alarm is generated. | ++-------------+-----------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-----------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-----------------------------------------------------------------+ + +Impact on the System +-------------------- + +The Flume agent instance for which the alarm is generated cannot provide services properly, and the data transmission tasks of the instance are temporarily interrupted. Real-time data is lost during real-time data transmission. + +Possible Causes +--------------- + +- The JAVA_HOME directory does not exist or the Java permission is incorrect. +- The Flume agent directory permission is incorrect. +- Flume agent fails to start. + +Procedure +--------- + +**Check whether the** **JAVA_HOME** **directory exists or whether the JAVA permission is correct.** + +#. Log in to the host for which the alarm is generated as user **root**. + +#. Run the following command to obtain the installation directory of the Flume client for which the alarm is generated: (The value of **AgentId** can be obtained from **Location** of the alarm.) + + **ps -ef|grep** *AgentId* **\| grep -v grep \| awk -F 'conf-file ' '{print $2}' \| awk -F 'fusioninsight' '{print $1}'** + +#. Run the **su -** *Flume installation user* command to switch to the Flume installation user and run the **cd** *Flume client installation directory*\ **/fusioninsight-flume-1.9.0/conf/** command to go to the Flume configuration directory. + +#. .. _alm-24001__li62419311105615: + + Run the **cat ENV_VARS \| grep JAVA_HOME** command. + +#. Check whether the **JAVA_HOME** directory exists. If the command output in :ref:`4 ` is not empty and **ll $JAVA_HOME/** is not empty, the **JAVA_HOME** directory exists. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6 `. + +#. .. _alm-24001__li28531951910: + + Specify a correct **JAVA_HOME** directory. + +#. .. _alm-24001__li1404949105615: + + Run the **$JAVA_HOME/bin/java -version** command to check whether the Flume agent running user has the Java execution permission. If the Java version is displayed in the command output, the Java permission meets the requirement. Otherwise, the Java permission does not meet the requirement. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + + .. note:: + + **JAVA_HOME** is the environment variable exported during Flume client installation. You can also go to *Flume client installation directory*\ **/fusioninsight-flume-1.9.0/conf** and run the **cat ENV_VARS \| grep JAVA_HOME** command to view the variable value. + +#. .. _alm-24001__li17575375105615: + + Run the **chmod 750 $JAVA_HOME/bin/java** command to grant the Java execution permission to the Flume agent running user. + +**Check the directory permission of the Flume agent.** + +9. .. _alm-24001__li12942052171720: + + Log in to the host for which the alarm is generated as user **root**. + +10. Run the following command to switch to the Flume agent installation directory: + + **cd** *Flume client installation directory*\ **/fusioninsight-flume-1.9.0/conf/** + +11. Run the **ls -al \* -R** command to check whether any file owner is the user who running the Flume agent. + + - If yes, go to :ref:`12 `. + - If no, run the **chown** command to change the file owner to the user who runs the Flume agent. + +**Check the Flume agent configuration.** + +12. .. _alm-24001__li62476305536: + + Run the **cat properties.properties \| grep spooldir** and **cat properties.properties \| grep TAILDIR** commands to check whether the Flume source type is spoolDir or tailDir. If any command output is displayed, the Flume source type is spoolDir or tailDir. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`17 `. + +13. .. _alm-24001__li124343613141: + + Check whether the data monitoring directory exists. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`14 `. + + .. note:: + + Run the **cat properties.properties \| grep spoolDir** command to view the spoolDir monitoring directory. + + |image1| + + Run the **cat properties.properties \| grep parentDir** command to view the tailDir monitoring directory. + + |image2| + +14. .. _alm-24001__li17447826131411: + + Specify a correct data monitoring directory. + +15. .. _alm-24001__li155813021512: + + Check whether the Flume agent user has the read, write, and execute permissions on the monitoring directory specified in :ref:`13 `. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`16 `. + + .. note:: + + Go to the monitoring directory as the Flume running user. If files can be created, the Flume running user has the read, write, and execute permissions on the monitoring directory. + +16. .. _alm-24001__li64671529111412: + + Run the **chmod 777** *Flume monitoring directory* command to grant the Flume agent running user the read, write, and execute permissions on the monitoring directory specified in :ref:`13 `. + +17. .. _alm-24001__li7261720101519: + + Check whether the components connected to the Flume sink are in safe mode. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`23 `. + + .. note:: + + If the sinks in the **properties.properties** configuration file are the HDFS sink and HBase sink, and the configuration file contains a keytab file, the components connected to the Flume sink are in safe mode. + + If the sink in the **properties.properties** configuration file is the kafka sink and **\*.security.protocol** is set to **SASL_PLAINTEXT** or **SASL_SSL**, Kafka connected to the Flume sink is in safe mode. + +18. .. _alm-24001__li922422181513: + + Run the **ll** *ketab path* command to check whether the keytab authentication path specified by the **\*.kerberosKeytab** parameter in the configuration file exists. + + - If yes, go to :ref:`20 `. + - If no, go to :ref:`19 `. + + .. note:: + + To view the ketab path, run the **cat properties.properties \| grep keytab** command. + + |image3| + +19. .. _alm-24001__li13851168209: + + Change the value of **kerberosKeytab** in :ref:`18 ` to the custom keytab path and go to :ref:`21 `. + +20. .. _alm-24001__li485841172212: + + Perform :ref:`18 ` to check whether the Flume agent running user has the permission to access the keytab authentication file. If the keytab path is returned, the user has the permission. Otherwise, the user does not have the permission. + + - If yes, go to :ref:`22 `. + - If no, go to :ref:`21 `. + +21. .. _alm-24001__li12245192314156: + + Run the **chmod 755** *ketab file* command to grant the read permission on the keytab file specified in :ref:`19 `, and restart the Flume process. + +22. .. _alm-24001__li8869032152012: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`23 `. + +**Collect the fault information.** + +23. .. _alm-24001__li28033577105615: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +24. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +25. Click |image4| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +26. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001088608580.png +.. |image2| image:: /_static/images/en-us_image_0000001135745627.png +.. |image3| image:: /_static/images/en-us_image_0000001135452179.png +.. |image4| image:: /_static/images/en-us_image_0269417447.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24003_flume_client_connection_interrupted.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24003_flume_client_connection_interrupted.rst new file mode 100644 index 0000000..4250de3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24003_flume_client_connection_interrupted.rst @@ -0,0 +1,123 @@ +:original_name: ALM-24003.html + +.. _ALM-24003: + +ALM-24003 Flume Client Connection Interrupted +============================================= + +Description +----------- + +The alarm module monitors the port connection status on the Flume server. This alarm is generated if the Flume server fails to receive a connection message from the Flume client in three consecutive minutes. + +This alarm is cleared after the Flume server receives a connection message from the Flume client. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Client IP Address | Specifies the IP address of the Flume client. | ++-------------------+---------------------------------------------------------+ +| Client Name | Specifies the agent name of the Flume client. | ++-------------------+---------------------------------------------------------+ +| Sink Name | Specifies the sink name of Flume Agent. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The communication between the Flume client and the server fails. The Flume client cannot send data to the Flume server. + +Possible Causes +--------------- + +- The network connection between the Flume client and the server is faulty. +- The Flume client's process is abnormal. +- The Flume client is incorrectly configured. + +Procedure +--------- + +**Check the network connection between the Flume client and the server.** + +#. Log in to the host whose IP address is specified by **Flume ClientIP** in the alarm information as user **root**. +#. Run the **ping** *Flume server IP address* command to check whether the network connection between the Flume client and the server is normal. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`11 `. + +**Check whether the Flume client's process is normal.** + +3. .. _alm-24003__li331072891100: + + Log in to the host whose IP address is specified by **Flume ClientIP** in the alarm information as user **root**. + +4. Run the **ps -ef|grep flume \|grep client** command to check whether the Flume client process exists. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`11 `. + +**Check the Flume client configuration.** + +5. .. _alm-24003__li540399381100: + + Log in to the host whose IP address is specified by **Flume ClientIP** in the alarm information as user **root**. + +6. Run the **cd** *Flume client installation directory*\ **/fusioninsight-flume-1.9.0/conf/** command to go to Flume's configuration directory. + +7. Run the **cat properties.properties** command to query the current configuration file of the Flume client. + +8. Check whether the **properties.properties** file is correctly configured according to the configuration description of the Flume agent. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +9. .. _alm-24003__li636773281100: + + Modify the **properties.properties** configuration file. + +**Check whether the alarm is cleared.** + +10. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect the fault information.** + +11. .. _alm-24003__li228207951100: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +12. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Collect logs in the **/var/log/Bigdata/flume-client** directory on the Flume client using a transmission tool. + +15. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24004_exception_occurs_when_flume_reads_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24004_exception_occurs_when_flume_reads_data.rst new file mode 100644 index 0000000..772534a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24004_exception_occurs_when_flume_reads_data.rst @@ -0,0 +1,147 @@ +:original_name: ALM-24004.html + +.. _ALM-24004: + +ALM-24004 Exception Occurs When Flume Reads Data +================================================ + +Description +----------- + +The alarm module monitors the status of Flume Source. This alarm is generated immediately when the duration in which Source fails to read the data exceeds the threshold. + +The default threshold is **0**, indicating that the threshold is disabled. You can change the threshold by modifying the **properties.properties** file in the **conf** directory. Specifically, modify the **NoDatatime** parameter of required the source. + +The alarm is cleared when Source reads the data and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24004 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+-----------------------------------------------------------------+ +| Name | Meaning | ++===============+=================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ +| AgentId | Specifies the ID of the agent for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ +| ComponentType | Specifies the component type for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ +| ComponentName | Specifies the component name for which the alarm is generated. | ++---------------+-----------------------------------------------------------------+ + +Impact on the System +-------------------- + +If data is found in the data source and Flume Source continuously fails to read data, the data collection is stopped. + +Possible Causes +--------------- + +- Flume Source is faulty, so data cannot be sent. +- The network is faulty, so the data cannot be sent. + +Procedure +--------- + +**Check whether Flume Source is faulty.** + +#. Open the **properties.properties** configuration file on the local PC, search for **keyword type = spooldir** in the file, and check whether the Flume source type is spoolDir. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-24004__li3561010711655: + + View the spoolDir directory to check whether all files are already transferred. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + + .. note:: + + The monitoring directory of spooDir is specified by the **.spoolDir** parameter in the **properties.properties** configuration file. If all files in the monitoring directory have been transferred, the file name extension of all files in the monitoring directory is **.COMPLETED**. + +#. .. _alm-24004__li3862672011655: + + Open the **properties.properties** configuration file on the local PC, search for **org.apache.flume.source.kafka.KafkaSource** in the file, and check whether the Flume source type is Kafka. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +#. .. _alm-24004__li4027383611655: + + Check whether the topic data configured by Kafka Source has been used up. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm-24004__li2692021011655: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Instance**. + +#. Go to the Flume instance page of the faulty node to check whether the indicator **Source Speed Metrics** in the alarm is 0. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`7 `. + +**Check the network connection between the faulty node and the node that corresponds to the Flume Source IP address.** + +7. .. _alm-24004__li5944850711655: + + Open the **properties.properties** configuration file on the local PC, search for **type = avro** in the file, and check whether the Flume source type is Avro. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`11 `. + +8. .. _alm-24004__li6550564111655: + + Log in to the faulty node as user **root**, and run the **ping** *IP address of the Flume source* command to check whether the peer host can be pinged successfully. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`9 `. + +9. .. _alm-24004__li5267986211655: + + Contact the network administrator to restore the network. + +10. In the alarm list, check whether the alarm is cleared after a period. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +**Collect the fault information.** + +11. .. _alm-24004__li1313046711655: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +12. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +13. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +14. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417449.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24005_exception_occurs_when_flume_transmits_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24005_exception_occurs_when_flume_transmits_data.rst new file mode 100644 index 0000000..f4b1896 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24005_exception_occurs_when_flume_transmits_data.rst @@ -0,0 +1,157 @@ +:original_name: ALM-24005.html + +.. _ALM-24005: + +ALM-24005 Exception Occurs When Flume Transmits Data +==================================================== + +Description +----------- + +The alarm module monitors the capacity status of Flume Channel. The alarm is generated immediately when the duration that Channel is fully occupied exceeds the threshold or the number of times that Source fails to send data to Channel exceeds the threshold. + +The default threshold is **10**. You can change the threshold by modifying the **channelfullcount** parameter of the related channel in the **properties.properties** configuration file in the **conf** directory. + +The alarm is cleared when the space of Flume Channel is released and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24005 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+-----------------------------------------------------------------------+ +| Name | Meaning | ++===============+=======================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ +| AgentId | Specifies the ID of the agent for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ +| ComponentType | Specifies the type of the component for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ +| ComponentName | Specifies the component for which the alarm is generated. | ++---------------+-----------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the disk usage of Flume Channel increases continuously, the time required for importing data to a specified destination prolongs. When the disk usage of Flume Channel reaches 100%, the Flume agent process pauses. + +Possible Causes +--------------- + +- Flume Sink is faulty, so the data cannot be sent. +- The network is faulty, so the data cannot be sent. + +Procedure +--------- + +**Check whether Flume Sink is faulty.** + +#. Open the **properties.properties** configuration file on the local PC, search for **type = hdfs** in the file, and check whether the Flume sink type is HDFS. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-24005__li893062611148: + + On FusionInsight Manager, check whether **HDFS Service Unavailable** alarm is generated in the alarm list and whether the HDFS service is stopped in the service list. + + - If the alarm is reported, clear it according to the handling suggestions of ALM-14000 HDFS Service Unavailable; if the HDFS service is stopped, start it. Then, go to :ref:`7 `. + - If no, go to :ref:`7 `. + +#. .. _alm-24005__li2804053511148: + + Open the **properties.properties** configuration file on the local PC, search for **type = hbase** in the file, and check whether the Flume sink type is HBase. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-24005__li5423421711148: + + On FusionInsight Manager, check whether **HBase Service Unavailable** alarm is generated in the alarm list and whether the HBase service is stopped in the service list. + + - If the alarm is reported, clear it according to the handling suggestions of ALM-19000 HBase Service Unavailable; if the HBase service is stopped, start it. Then, go to :ref:`7 `. + - If no, go to :ref:`7 `. + +#. .. _alm-24005__li3655261711148: + + Open the **properties.properties** configuration file on the local PC, search for **org.apache.flume.sink.kafka.KafkaSink** in the file, and check whether the Flume sink type is Kafka. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`9 `. + +#. .. _alm-24005__li5047900111148: + + On FusionInsight Manager, check whether **Kafka Service Unavailable** alarm is generated in the alarm list and whether the Kafka service is stopped in the service list. + + - If the alarm is reported, clear it according to the handling suggestions of ALM-38000 Kafka Service Unavailable; if the Kafka service is stopped, start it. Then, go to :ref:`7 `. + - If no, go to :ref:`7 `. + +#. .. _alm-24005__li5165783111148: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Instance**. + +#. Go to the Flume instance page of the faulty node to check whether the indicator **Sink Speed Metrics** is 0. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`9 `. + +**Check the network connection between the faulty node and the node that corresponds to the Flume Sink IP address.** + +9. .. _alm-24005__li3789323111148: + + Open the **properties.properties** configuration file on the local PC, search for **type = avro** in the file, and check whether the Flume sink type is Avro. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`13 `. + +10. .. _alm-24005__li3657487511148: + + Log in to the faulty node as user **root**, and run the **ping** *IP address of the Flume sink* command to check whether the peer host can be pinged successfully. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`11 `. + +11. .. _alm-24005__li6073842411148: + + Contact the network administrator to restore the network. + +12. In the alarm list, check whether the alarm is cleared after a period. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect the fault information.** + +13. .. _alm-24005__li2555818811148: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +14. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24006_heap_memory_usage_of_flume_server_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24006_heap_memory_usage_of_flume_server_exceeds_the_threshold.rst new file mode 100644 index 0000000..d527510 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24006_heap_memory_usage_of_flume_server_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-24006.html + +.. _ALM-24006: + +ALM-24006 Heap Memory Usage of Flume Server Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the heap memory usage of the Flume service every 60 seconds. This alarm is generated when the heap memory usage of the Flume instance exceeds the threshold (95% of the maximum memory) for 10 consecutive times. This alarm is cleared when the heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The heap memory of the Flume instance is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check the heap memory usage.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **Flume Heap Memory Usage Exceeds the Threshold**, and view the **Location** information. Check the name of the host for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the page that is displayed, click the **Instance** tab. On the displayed tab page, select the role corresponding to the host name for which the alarm is generated and select **Customize** from the drop-down list in the upper right corner of the chart area. Choose **Agent** and select **Flume Heap Memory Resource Percentage**. Then, click **OK**. + +#. Check whether the heap memory used by Flume reaches the threshold (95% of the maximum heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-24006__li11521246145513: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Configuration**. On the page that is displayed, click **All Configurations** and choose **Flume** > **System**. Set **-Xmx** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for the Flume server is insufficient for data transmission. You are advised to change the heap memory to: Channel capacity x Maximum size of a single data record x Number of channels. Note that the value of **xmx** cannot exceed the remaining memory of the node. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-24006__li42224042151734: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24007_flume_server_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24007_flume_server_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..fb54f89 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24007_flume_server_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-24007.html + +.. _ALM-24007: + +ALM-24007 Flume Server Direct Memory Usage Exceeds the Threshold +================================================================ + +Description +----------- + +The system checks the direct memory usage of the Flume service every 60 seconds. This alarm is generated when the direct memory usage of the Flume instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the Flume direct memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24007 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Direct memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The direct memory of the Flume process is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **Flume Direct Memory Usage Exceeds the Threshold**, and view the **Location** information. Check the name of the host for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the page that is displayed, click the **Instance** tab. On the displayed tab page, select the role corresponding to the host name for which the alarm is generated and select **Customize** from the drop-down list in the upper right corner of the chart area. Choose **Agent** and select **Flume Direct Memory Resource Percentage**. Then, click **OK**. + +#. Check whether the direct memory used by Flume reaches the threshold (80% of the maximum direct memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-24007__li10450762161055: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Configuration**. On the page that is displayed, click **All Configurations** and choose **Flume** > **System**. Set **-XX:MaxDirectMemorySize** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the direct memory size configured for the Flume server instance cannot meet service requirements. You are advised to change the value of **-XX:MaxDirectMemorySize** to twice the current direct memory size or change the value based on site requirements. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-24007__d0e43963: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24008_flume_server_non_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24008_flume_server_non_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..ce09299 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24008_flume_server_non_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-24008.html + +.. _ALM-24008: + +ALM-24008 Flume Server Non Heap Memory Usage Exceeds the Threshold +================================================================== + +Description +----------- + +The system checks the non-heap memory usage of the Flume service every 60 seconds. This alarm is generated when the non-heap memory usage of the Flume instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the non-heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24008 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Non-heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The non-heap memory of the Flume instance is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **Flume Non Heap Memory Usage Exceeds the Threshold**, and view the **Location** information. Check the name of the host for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the page that is displayed, click the **Instance** tab. On the displayed tab page, select the role corresponding to the host name for which the alarm is generated and select **Customize** from the drop-down list in the upper right corner of the chart area. Choose **Agent** and select **Flume Non Heap Memory Resource Percentage**. Then, click **OK**. + +#. Check whether the non-heap memory used by Flume reaches the threshold (80% of the maximum non-heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-24008__li29985659161559: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Configuration**. On the page that is displayed, click **All Configurations** and choose **Flume** > **System**. Set **-XX: MaxPermSize** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the non-heap memory size configured for the Flume server instance cannot meet service requirements. You are advised to change the value of **-XX:MaxPermSize** to twice the current non-heap memory size or change the value based on site requirements. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-24008__d0e44186: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24009_flume_server_garbage_collection_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24009_flume_server_garbage_collection_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..9f912da --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24009_flume_server_garbage_collection_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-24009.html + +.. _ALM-24009: + +ALM-24009 Flume Server Garbage Collection (GC) Time Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the GC duration of the Flume process every 60 seconds. This alarm is generated when the GC duration of the Flume process exceeds the threshold (12 seconds by default) for five consecutive times. This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Flume data transmission efficiency decreases. + +Possible Causes +--------------- + +The heap memory of the Flume process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +**Check the GC duration.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **GC Duration Exceeds the Threshold**, and view the **Location** information. Check the name of the host for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the page that is displayed, click the **Instance** tab. On the displayed tab page, select the role corresponding to the host name for which the alarm is generated and select **Customize** from the drop-down list in the upper right corner of the chart area. Choose **Agent** and select **Garbage Collection (GC) Duration of Flume**. Then, click **OK**. + +#. Check whether the GC duration of the Flume process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-24009__d0e44388: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Configuration**. On the page that is displayed, click **All Configurations** and choose **Flume** > **System**. Set **-Xmx** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for the Flume server is insufficient for data transmission. You are advised to change the heap memory to: Channel capacity x Maximum size of a single data record x Number of channels. Note that the value of **xmx** cannot exceed the remaining memory of the node. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-24009__d0e44409: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24010_flume_certificate_file_is_invalid_or_damaged.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24010_flume_certificate_file_is_invalid_or_damaged.rst new file mode 100644 index 0000000..421608f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24010_flume_certificate_file_is_invalid_or_damaged.rst @@ -0,0 +1,113 @@ +:original_name: ALM-24010.html + +.. _ALM-24010: + +ALM-24010 Flume Certificate File Is Invalid or Damaged +====================================================== + +Description +----------- + +Flume checks whether the Flume certificate file is valid (whether the certificate exists and whether the certificate format is correct) every hour. This alarm is generated when the certificate file is invalid or damaged. This alarm is automatically cleared when the certificate file becomes valid again. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24010 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Flume client cannot access the Flume server. + +Possible Causes +--------------- + +The Flume certificate file is invalid or damaged. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24010 Flume Certificate File Is Invalid or Damaged**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the Flume service certificate directory: + + **cd** **${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the **ls -l** command to check whether the **flume_sChat.crt** file exists. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-24010__li224142155111: + + Run the **openssl x509 -in flume_sChat.crt -text -noout** command to check whether certificate details are displayed properly. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`6 `. + +6. .. _alm-24010__li1317765214610: + + Run the following command to go to the Flume script directory: + + **cd** **${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +7. Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -f** *sNetty12@* **-g** *cNetty12@* + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-24010__li57811511185514: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`9 `. + - If no, no further action is required. + +**Collect the fault information.** + +9. .. _alm-24010__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001214152530.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24011_flume_certificate_file_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24011_flume_certificate_file_is_about_to_expire.rst new file mode 100644 index 0000000..f4ca0c7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24011_flume_certificate_file_is_about_to_expire.rst @@ -0,0 +1,116 @@ +:original_name: ALM-24011.html + +.. _ALM-24011: + +ALM-24011 Flume Certificate File Is About to Expire +=================================================== + +Description +----------- + +Flume checks whether the Flume certificate file is about to expire every hour. This alarm is generated when the remaining validity period is at most 30 days. This alarm is automatically cleared when the remaining validity period is greater than 30 days. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24011 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Currently, there is no impact on the system. + +Possible Causes +--------------- + +The Flume certificate file is about to expire. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24011 Flume Certificate Is About to Expire**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the Flume service certificate directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the following command to check the effective time and expiration time of the Flume user certificate: + + **openssl x509 -noout -text -in flume_sChat.crt** + +5. Perform :ref:`6 ` to :ref:`7 ` during off-peak hours to update the certificate file as needed. + +6. .. _alm-24011__li1778591515212: + + Run the following command to go to the Flume script directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +7. .. _alm-24011__li1599711402218: + + Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -f** *sNetty12@* **-g** *cNetty12@* + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +8. .. _alm-24011__li6673192244411: + + Log in to the Flume node for which the alarm is generated as user **omm** and repeat :ref:`6 ` to :ref:`7 `. Then, check whether the alarm is automatically cleared one hour later. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`10 `. + +9. .. _alm-24011__li127861713811: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`10 `. + - If no, no further action is required. + +**Collect the fault information.** + +10. .. _alm-24011__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +11. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001214312492.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24012_flume_certificate_file_has_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24012_flume_certificate_file_has_expired.rst new file mode 100644 index 0000000..1e81cb8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24012_flume_certificate_file_has_expired.rst @@ -0,0 +1,117 @@ +:original_name: ALM-24012.html + +.. _ALM-24012: + +ALM-24012 Flume Certificate File Has Expired +============================================ + +Description +----------- + +Flume checks whether its certificate file in the system has expired every hour. This alarm is generated when the server certificate has expired. This alarm is automatically cleared when the certificate file becomes valid again. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24012 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Flume client cannot access the Flume server. + +Possible Causes +--------------- + +The Flume certificate file has expired. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24012 Flume Certificate Has Expired**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the Flume service certificate directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the following command to check the effective time and expiration time of the HA user certificate to determine whether the certificate file is still in the validity period: + + **openssl x509 -noout -text -in flume_sChat.crt** + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`5 `. + +5. .. _alm-24012__li161213411093: + + Run the following command to go to the Flume script directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +6. .. _alm-24012__li084616591915: + + Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -f** *sNetty12@* **-g** *cNetty12@* + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`7 `. + +7. .. _alm-24012__li172496117507: + + Log in to the Flume node for which the alarm is generated as user **omm** and repeat :ref:`5 ` to :ref:`6 `. Then, check whether the alarm is automatically cleared one hour later. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-24012__li1788491915716: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`9 `. + - If no, no further action is required. + +**Collect the fault information.** + +9. .. _alm-24012__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **Flume** for the target cluster. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001259272397.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24013_flume_monitorserver_certificate_file_is_invalid_or_damaged.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24013_flume_monitorserver_certificate_file_is_invalid_or_damaged.rst new file mode 100644 index 0000000..7540616 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24013_flume_monitorserver_certificate_file_is_invalid_or_damaged.rst @@ -0,0 +1,113 @@ +:original_name: ALM-24013.html + +.. _ALM-24013: + +ALM-24013 Flume MonitorServer Certificate File Is Invalid or Damaged +==================================================================== + +Description +----------- + +MonitorServer checks whether its certificate file is valid (whether the certificate exists and whether the certificate format is correct) every hour. This alarm is generated when the certificate file is invalid or damaged. This alarm is automatically cleared when the certificate file becomes valid again. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24013 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Flume client cannot access the Flume server. + +Possible Causes +--------------- + +The MonitorServer certificate file is invalid or damaged. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24013 MonitorServer Certificate File Is Invalid or Damaged**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the MonitorServer certificate file directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the **ls -l** command to check whether the **ms_sChat.crt** file exists: + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-24013__li224142155111: + + Run the **openssl x509 -in ms_sChat.crt -text -noout** command to check whether certificate details are displayed. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`6 `. + +6. .. _alm-24013__li20454978185: + + Run the following command to go to the Flume script directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +7. Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -m** *sKitty12@* **-n** *cKitty12@* + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-24013__li57811511185514: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`9 `. + - If no, no further action is required. + +**Collect the fault information.** + +9. .. _alm-24013__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Select **MonitorServer** in the required cluster for **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001214315364.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24014_flume_monitorserver_certificate_is_about_to_expire.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24014_flume_monitorserver_certificate_is_about_to_expire.rst new file mode 100644 index 0000000..cc35ae4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24014_flume_monitorserver_certificate_is_about_to_expire.rst @@ -0,0 +1,116 @@ +:original_name: ALM-24014.html + +.. _ALM-24014: + +ALM-24014 Flume MonitorServer Certificate Is About to Expire +============================================================ + +Description +----------- + +MonitorServer checks whether its certificate file is about to expire every hour. This alarm is generated when the remaining validity period is at most 30 days. This alarm is automatically cleared when the remaining validity period is greater than 30 days. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24014 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Currently, there is no impact on the system. + +Possible Causes +--------------- + +The MonitorServer certificate file is about to expire. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24014 MonitorServer Certificate Is About to Expire**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the MonitorServer certificate file directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the following command to check the effective time and expiration time of the MonitorServer user certificate: + + **openssl x509 -noout -text -in ms_sChat.crt** + +5. Perform :ref:`6 ` to :ref:`7 ` during off-peak hours to update the certificate file as needed. + +6. .. _alm-24014__li368813032118: + + Run the following command to go to the Flume script directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +7. .. _alm-24014__li153512414216: + + Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -m** *sKitty12@* **-n** *cKitty12@* + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +8. .. _alm-24014__li6673192244411: + + Log in to the Flume node for which the alarm is generated as user **omm** and repeat :ref:`6 ` to :ref:`7 `. Then, check whether the alarm is automatically cleared one hour later. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`10 `. + +9. .. _alm-24014__li127861713811: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`10 `. + - If no, no further action is required. + +**Collect the fault information.** + +10. .. _alm-24014__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +11. Select **MonitorServer** in the required cluster for **Service**. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001258875319.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24015_flume_monitorserver_certificate_file_has_expired.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24015_flume_monitorserver_certificate_file_has_expired.rst new file mode 100644 index 0000000..8f9a21b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-24015_flume_monitorserver_certificate_file_has_expired.rst @@ -0,0 +1,117 @@ +:original_name: ALM-24015.html + +.. _ALM-24015: + +ALM-24015 Flume MonitorServer Certificate File Has Expired +========================================================== + +Description +----------- + +MonitorServer checks whether its certificate file in the system has expired every hour. This alarm is generated when the server certificate has expired. This alarm is automatically cleared when the MonitorServer certificate file becomes valid again. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24015 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Flume client cannot access the Flume server. + +Possible Causes +--------------- + +The MonitorServer certificate file has expired. + +Procedure +--------- + +**View alarm information.** + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. On the page that is displayed, locate the row containing **ALM-24015 MonitorServer Certificate Has Expired**, and view the **Location** information. View the IP address of the instance for which the alarm is generated. + +**Check whether the certificate file in the system is valid. If it is not, generate a new one.** + +2. Log in to the node for which the alarm is generated as user **root** and run the **su - omm** command to switch to user **omm**. + +3. Run the following command to go to the MonitorServer certificate file directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/conf** + +4. Run the following command to check the effective time and expiration time of the user certificate to determine whether the certificate file is still in the validity period: + + **openssl x509 -noout -text -in ms_sChat.crt** + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`5 `. + +5. .. _alm-24015__li160343112815: + + Run the following command to go to the Flume script directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter_*/install/FusionInsight-Flume-*/flume/bin** + +6. .. _alm-24015__li68486146283: + + Run the following command to generate a new certificate file. Then check whether the alarm is automatically cleared one hour later. + + **sh geneJKS.sh -m** *sKitty12@* **-n** *cKitty12@* + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`7 `. + +7. .. _alm-24015__li172496117507: + + Log in to the Flume node for which the alarm is generated as user **omm** and repeat :ref:`5 ` to :ref:`6 `. Then, check whether the alarm is automatically cleared one hour later. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm-24015__li1788491915716: + + Check whether this alarm is generated again during periodic system check. + + - If yes, go to :ref:`9 `. + - If no, no further action is required. + +**Collect the fault information.** + +9. .. _alm-24015__li593632253716: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Select **MonitorServer** in the required cluster for **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001259115323.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst new file mode 100644 index 0000000..9caa214 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst @@ -0,0 +1,124 @@ +:original_name: ALM-25000.html + +.. _ALM-25000: + +ALM-25000 LdapServer Service Unavailable +======================================== + +Description +----------- + +The system checks the LdapServer service status every 30 seconds. This alarm is generated when the system detects that both the active and standby LdapServer services are abnormal. + +This alarm is cleared when the system detects that one or two LdapServer services are normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +When this alarm is generated, no operation can be performed for the KrbServer users and LdapServer users in the cluster. For example, users, user groups, or roles cannot be added, deleted, or modified, and user passwords cannot be changed on the FusionInsight Manager portal. The authentication for existing users in the cluster is not affected. + +Possible Causes +--------------- + +- The node where the LdapServer service locates is faulty. +- The LdapServer process is abnormal. + +Procedure +--------- + +**Check whether the nodes where the two SlapdServer instances of the LdapServer service are located are faulty.** + +#. .. _alm-25000__li1018355492146: + + On FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* **> Services** > **LdapServer** > **Instance** to go to the LdapServer instance page to obtain the host name of the node where the two SlapdServer instances locates. + +#. Choose **O&M > Alarm > Alarms**. On the **Alarm** page of the FusionInsight Manager system, check whether any alarm of **Node Fault** exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`6 `. + +#. .. _alm-25000__li3636395592146: + + Check whether the host name in the alarm is consistent with the :ref:`1 ` host name. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-25000__li5979927992146: + + Handle the alarm according to "ALM-12006 Node Fault". + +#. Check whether **LdapServer Service Unavailable** is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Check whether the LdapServer process is normal.** + +6. .. _alm-25000__li6664070892146: + + Choose **O&M > Alarm > Alarms**. On the **Alarm** page of the FusionInsight Manager system, check whether any alarm of **Process Fault** exists. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`10 `. + +7. .. _alm-25000__li4274962392146: + + Check whether the service and host name in the alarm are consistent with the LdapServer service and host name. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`10 `. + +8. .. _alm-25000__li4016743392146: + + Handle the alarm according to "ALM-12007 Process Fault". + +9. Check whether **LdapServer Service Unavailable** is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Collect fault information.** + +10. .. _alm-25000__li4495032292146: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +11. Select **LdapServer** in the required cluster from the **Service**. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417455.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst new file mode 100644 index 0000000..df4370e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst @@ -0,0 +1,153 @@ +:original_name: ALM-25004.html + +.. _ALM-25004: + +ALM-25004 Abnormal LdapServer Data Synchronization +================================================== + +Description +----------- + +The system checks the LdapServer data every 30 seconds. This alarm is generated when the data on the active and standby LdapServers of Manager is inconsistent for 12 consecutive times. This alarm is cleared when the data on the active and standby LdapServers is consistent. + +The system checks the LdapServer data every 30 seconds. This alarm is generated when the LdapServer data in the cluster is inconsistent with that on Manager for 12 consecutive times. This alarm is cleared when the data is consistent. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25004 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +LdapServer data inconsistency occurs because the LdapServer data in Manager is damaged or the LdapServer data in the cluster is damaged. The LdapServer process with damaged data cannot provide services externally, and the authentication functions of Manager and the cluster are affected. + +Possible Causes +--------------- + +- The network of the node where the LdapServer process locates is faulty. +- The LdapServer process is abnormal. +- The OS restart damages data on LdapServer. + +Procedure +--------- + +**Check whether the network where the LdapServer nodes reside is faulty.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. Record the IP address of HostName in the alarm locating information as IP1 (if multiple alarms exist, record the IP addresses as IP1, IP2, and IP3 respectively). + +#. Contact O&M personnel and log in to the nodes corresponding to IP 1. Run the ping command to check whether the IP address of the management plane of the active OMS node can be pinged. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-25004__li8942461952: + + Contact the network administrator to recover the network and check whether **Abnormal LdapServer Data Synchronization** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the LdapServer processes are normal.** + +4. .. _alm-25004__li16739739952: + + On the **Alarm** page of FusionInsight Manager, check whether the **OLdap Resource Abnormal** exists. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +5. .. _alm-25004__li13741581952: + + Clear the alarm by following the steps provided in "ALM-12004 OLdap Resource Abnormal". + +6. Check whether **Abnormal LdapServer Data Synchronization** is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +7. .. _alm-25004__li40748590952: + + On the **Alarm** page of FusionInsight Manager, check whether **Process Faul**\ t is generated for the LdapServer service. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`10 `. + +8. .. _alm-25004__li12301476952: + + Handle the alarm according to "ALM-12007 Process Fault". + +9. Check whether **Abnormal LdapServer Data Synchronization** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +**Check whether the LdapServer processes are normal.** + +10. .. _alm-25004__li36315373952: + + On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Record the IP address of HostName in the alarm locating information as "IP1" (if multiple alarms exist, record the IP addresses as "IP1", "IP2", and "IP3" respectively). Choose **Cluster** > *Name of the desired cluster* > **Services** > **LdapServer** > **Configurations**. Record the port number of LdapServer as "PORT". (If the IP address in the alarm locating information is the IP address of the standby management node, choose **System** > **OMS** > **oldap** > **Modify Configuration** and record the listening port number of LdapServer.) + +11. Log in to the nodes corresponding to IP1 as user **omm**. + +12. Run the following command to check whether errors are displayed in the queried information. + + **ldapsearch -H ldaps://**\ *IP1*:*PORT* **-LLL -x -D cn=root,dc=hadoop,dc=com -W -b ou=Peoples,dc=hadoop,dc=com** + + After running the command, enter the **LDAP** administrator password. Contact the system administrator to obtain the password. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`15 `. + +13. .. _alm-25004__li5200119952: + + Recover the LdapServer and OMS nodes using data backed up before the alarm is generated. + + .. note:: + + Use the OMS data and LdapServer data backed up at the same point in time to recover the data. Otherwise, the service and operation may fail. To recover data when services run properly, you are advised to manually back up the latest management data and then recover the data. Otherwise, Manager data produced between the backup point in time and the recovery point in time will be lost. + +14. Check whether alarm **Abnormal LdapServer Data Synchronization** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +**Collect fault information.** + +15. .. _alm-25004__li40582301952: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +16. Select **LdapServer** in the required cluster and **OmsLdapServer** from the **Service**. + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417456.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25005_nscd_service_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25005_nscd_service_exception.rst new file mode 100644 index 0000000..575650f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25005_nscd_service_exception.rst @@ -0,0 +1,194 @@ +:original_name: ALM-25005.html + +.. _ALM-25005: + +ALM-25005 nscd Service Exception +================================ + +Description +----------- + +The system checks the status of the nscd service every 60 seconds. This alarm is generated when the nscd process fails to be queried for four consecutive times (three minutes) or users in LdapServer cannot be obtained. + +This alarm is cleared when the process is restored and users in LdapServer can be obtained. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25005 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The alarmed node may not be able to synchronize data from LdapServer. The **id** command may fail to obtain the LDAP data, affecting upper-layer services. + +Possible Causes +--------------- + +- The nscd service is not started. +- The network is faulty, and cannot access the LDAP server. +- NameService is abnormal. +- Users cannot be queried because the OS executes commands too slowly. + +Procedure +--------- + +**Check whether the nscd service is started.** + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. Record the IP address of **HostName** in **Location** of the alarm as **IP1** (if multiple alarms exist, record the IP addresses as **IP1**, **IP2**, and **IP3** respectively). + +#. Contact the O&M personnel to access the node using IP1 as user **root**. Run the **ps -ef \| grep nscd** command on the node and check whether the **/usr/sbin/nscd** process is started. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-25005__li600689958513: + + Run the **service nscd restart** command as user **root** to restart the nscd service. Then run the **ps -ef \| grep nscd** command to check whether the nscd service is started. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`15 `. + +#. .. _alm-25005__li67767558513: + + Wait for 5 minutes and run the ps -ef \| grep nscd command again as user root. Check whether the service exists. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`15 `. + +**Check whether the network is faulty, and whether the LDAP server can be accessed.** + +5. .. _alm-25005__li423153448513: + + Log in to the alarmed node as user **root** and run the **ping** command to check whether the network connectivity between this node and the LdapServer node. is normal. + + - If yes, go to :ref:`6 `. + - If no, contact network administrators to troubleshoot the fault. + +**Check whether the NameService is normal.** + +6. .. _alm-25005__li297764118513: + + Log in to the alarmed node as user **root**. Run the **cat /etc/nsswitch.conf** command to check whether the **passwd**, **group**, **services**, **netgroup**, and **aliases** of NameService are correctly configured. + + The correct parameter configurations are as follows: + + **passwd**: **compat ldap**; **group**: **compat ldap**; **services**: **files ldap**; **netgroup**: **files ldap**; **aliases**: **files ldap** + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +7. .. _alm-25005__li11806553308: + + Log in to the alarmed node as user **root**. Run the **cat /etc/nscd.conf** command to check whether the **enable-cache passwd**, **positive-time-to-live passwd**, **enable-cache group**, and **positive-time-to-live group** in the configuration file are correctly configured. + + The correct parameter configurations are as follows: + + **enable-cache passwd**: **yes**; **positive-time-to-live passwd**: **600**; **enable-cache group**: **yes**; **positive-time-to-live group**: **3600** + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`10 `. + +8. .. _alm-25005__li389947948513: + + Run the **/usr/sbin/nscd -i group** and **/usr/sbin/nscd -i passwd** commands as user **root**. Wait for 2 minutes and run the **id admin** and **id backup/manager** commands to check whether results can be queried. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`15 `. + +9. .. _alm-25005__li195824098513: + + Run the **vi /etc/nsswitch.conf** command as user **root**. Correct the configurations in :ref:`6 ` and save the file. Run the **service nscd restart** command to restart the nscd service. Wait for 2 minutes and run the **id admin** and **id backup/manager** commands to check whether results can be queried. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`15 `. + +10. .. _alm-25005__li1648032715218: + + Run the **vi /etc/nscd.conf** command as user **root**. Correct the configurations in :ref:`7 ` and save the file. Run the **service nscd restart** command to restart the nscd service. Wait for 2 minutes and run the **id admin** and **id backup/manager** commands to check whether results can be queried. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`15 `. + +11. .. _alm-25005__li551461168513: + + Log in to the FusionInsight Manager portal. Wait for 5 minutes and check whether the **nscd Service Exception** alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`12 `. + +**Check whether frame freezing occurs when running a command in the operating system.** + +12. .. _alm-25005__li1693832195142: + + Log in to the faulty node as user **root**, run the **id admin** command, and check whether the command execution takes a long time. If the command execution takes more than 3 seconds, the command execution is deemed to be slow. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`15 `. + +13. .. _alm-25005__li97084049527: + + Run the **cat /var/log/messages** command to check whether the nscd frequently restarts or the error information "Can't contact LDAP server" exists. + + nscd exception example: + + .. code-block:: + + Feb 11 11:44:42 10-120-205-33 nscd: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.55:21780: Can't contact LDAP server + Feb 11 11:44:43 10-120-205-33 ntpq: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.55:21780: Can't contact LDAP server + Feb 11 11:44:44 10-120-205-33 ntpq: nss_ldap: failed to bind to LDAP server ldaps://10.120.205.92:21780: Can't contact LDAP server + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-25005__li3335145595227: + + Run the **vi$BIGDATA_HOME/tmp/random_ldap_ip_order** command to modify the number at the end. If the original number is an odd number, change it to an even number. If the number is an even number, change it to an odd number. + + Run the **vi /etc/ldap.conf** command to enter the editing mode, press **Insert** to start editing, and then change the first two IP addresses of the URI configuration item. + + After the modification is complete, press **Esc** to exit the editing mode and enter **:wq!** to save the settings and exit. + + Run the **service nscd restart** command to restart the nscd service. Wait 5 minutes and run the **id admin** command again. Check whether the command execution is slow. + + - If yes, go to :ref:`15 `. + - If no, log in to other faulty nodes and repeat :ref:`12 ` to :ref:`14 ` to check whether the first LdapServer node in the URI before modifying **/etc/ldap.conf** is faulty. For example, check whether the service IP address is unreachable, the network delay is too long, or other abnormal software is deployed. + +**Collect the fault information.** + +15. .. _alm-25005__li265529978513: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +16. Expand the drop-down list next to the **Service** field. In the **Services** dialog box that is displayed, select **LdapClient** for the target cluster. + +17. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +18. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895532.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25006_sssd_service_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25006_sssd_service_exception.rst new file mode 100644 index 0000000..ca32471 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25006_sssd_service_exception.rst @@ -0,0 +1,186 @@ +:original_name: ALM-25006.html + +.. _ALM-25006: + +ALM-25006 Sssd Service Exception +================================ + +Description +----------- + +The system checks the status of the sssd service every 60 seconds. This alarm is generated when the sssd process fails to be queried for four consecutive times (three minutes) or users in LdapServer cannot be obtained. + +This alarm is cleared when the process is restored and users in LdapServer can be obtained. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+------------------------------------------------------------------+ +| Name | Meaning | ++=============+==================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The alarmed node may not be able to synchronize data from LdapServer. The id command may fail to obtain the LDAP data, affecting upper-layer services. + +Possible Causes +--------------- + +- The sssd service is not started or is incorrectly started. +- The network is faulty and cannot access the LDAP server. +- NameService is abnormal. + +- Users cannot be queried because the OS executes commands too slowly. + +Procedure +--------- + +**Check whether the sssd service is correctly started.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms**. Find the IP address of **HostName** in **Location** of the alarm and record it as IP1 (if multiple alarms exist, record the IP addresses as IP1, IP2, and IP3 respectively). + +#. .. _alm-25006__li872546684636: + + Contact the O&M personnel to access the node using IP1 as user root. Run the **ps -ef \| grep sssd** command and check whether the **/usr/sbin/sssd** process is started. + + - If the process is started, go to :ref:`3 `. + - If the process is not started, go to :ref:`4 `. + +#. .. _alm-25006__li4389913484636: + + Check whether the sssd process queried in :ref:`2 ` has three subprocesses. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _alm-25006__li1838613984636: + + Run the **service sssd restart** command as user **roo**\ t to restart the sssd service. Then run the **ps -ef \| grep sssd** command to check whether the sssd process is normal. + + In the normal state, the **/usr/sbin/sssd** process has three subprocesses: **/usr/libexec/sssd/sssd_be**, **/usr/libexec/sssd/sssd_nss**, and **/usr/libexec/sssd/sssd_pam**. + + - If it exists, go to :ref:`9 `. + - If it does not exist, go to :ref:`13 `. + +**Check whether the LDAP server can be accessed.** + +5. .. _alm-25006__li9130384636: + + Log in to the alarmed node as user **root**. Run the **ping** command to check the network connectivity between this node and the LdapServer node. + + - If the network is normal, go to :ref:`6 `. + - If the network is faulty, contact network administrators to troubleshoot the fault. + +**Check whether NameService is normal.** + +6. .. _alm-25006__li4687622084636: + + Log in to the alarmed node as user **root**. Run the **cat /etc/nsswitch.conf** command and check the **passwd** and **group** configurations of NameService. + + The correct parameter configurations are as follows: **passwd: compat ldap** and **group: compat ldap**. + + - If the configurations are correct, go to :ref:`7 `. + - If the configurations are incorrect, go to :ref:`8 `. + +7. .. _alm-25006__li4799532284636: + + Run the **/usr/sbin/sss_cache -G** and **/usr/sbin/sss_cache -U** commands as user **root**. Wait for 2 minutes and run the **id admin** and **id backup/manager** commands to check whether results can be queried. + + - If results are queried, go to :ref:`9 `. + - If no result is queried, go to :ref:`13 `. + +8. .. _alm-25006__li2317384684636: + + Run the **vi /etc/nsswitch.conf** command as user **root**. Correct the configurations in :ref:`6 ` and save the file. Run the **service sssd restart** command to restart the sssd service. Wait for 2 minutes and run the **id admin** and **id backup/manager** commands to check whether results can be queried. + + - If results are queried, go to :ref:`9 `. + - If no result is queried, go to :ref:`13 `. + +9. .. _alm-25006__li4894866984636: + + Log in to the FusionInsight Manager portal. Wait for 5 minutes and check whether the **sssd Service Exception** alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm persists, go to :ref:`10 `. + +**Check whether frame freezing occurs when running a command in the operating system.** + +10. .. _alm-25006__li44241319183338: + + Log in to the faulty node as user **root**, run the **id admin** command, and check whether the command execution takes a long time. If the command execution takes more than 3 seconds, the command execution is deemed to be slow. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`13 `. + +11. .. _alm-25006__li10247506183338: + + Run the **cat /var/log/messages** command to check whether the sssd frequently restarts or the error information **Can't contact LDAP server** exists. + + sssd restart example: + + .. code-block:: + + Feb 7 11:38:16 10-132-190-105 sssd[pam]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd[be[default]]: Shutting down + Feb 7 11:38:16 10-132-190-105 sssd: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[be[default]]: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[nss]: Starting up + Feb 7 11:38:16 10-132-190-105 sssd[pam]: Starting up + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-25006__li9709691183338: + + Run the **vi $BIGDATA_HOME/tmp/random_ldap_ip_order** command to modify the number at the end. If the original number is an odd number, change it to an even number. If the number is an even number, change it to an odd number. + + Run the **vi /etc/sssd/sssd.conf** command to reverse the first two IP addresses of the **ldap_uri** configuration item, save the settings, and exit. + + Run the **ps -ef \| grep sssd** command to query the ID of the sssd process, kill it, and run the **/usr/sbin/sssd -D -f** command to restart the sssd service. Wait 5 minutes and run the **id admin** command again. + + Check whether the command execution is slow. + + - If yes, go to :ref:`13 `. + - If no, log in to other faulty nodes and run :ref:`10 ` to :ref:`12 `. Collect logs and check whether the first ldapserver node in the ldap_uri before modifying **/etc/sssd/sssd.conf** is faulty. For example, check whether the service IP address is unreachable, the network latency is too long, or other abnormal software is deployed. + +**Collect fault information.** + +13. .. _alm-25006__li4877321484636: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +14. Select **LdapClient** in the required cluster from the **Service**. + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417458.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25500_krbserver_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25500_krbserver_service_unavailable.rst new file mode 100644 index 0000000..2ef4c3e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-25500_krbserver_service_unavailable.rst @@ -0,0 +1,117 @@ +:original_name: ALM-25500.html + +.. _ALM-25500: + +ALM-25500 KrbServer Service Unavailable +======================================= + +Description +----------- + +The system checks the KrbServer service status every 30 seconds. This alarm is generated when the system detects that the KrbServer service is abnormal. + +This alarm is cleared when the system detects that the KrbServer service is normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25500 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +When this alarm is generated, no operation can be performed for the KrbServer component in the cluster. The authentication of KrbServer in other components will be affected. The running status of components that depend on KrbServer in the cluster is Bad. + +Possible Causes +--------------- + +- The node where the KrbServer service locates is faulty. +- The OLdap service is abnormal. + +Procedure +--------- + +**Check whether the node where the KrbServer service locates is faulty.** + +#. .. _alm-25500__li19872481202241: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **KrbServer** > **Instance** to go to the KrbServer instance page to obtain the host name of the node where the KrbServer service locates. + +#. On the **Alarm** page of the FusionInsight Manager system, check whether any alarm of **Node Fault** exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`6 `. + +#. .. _alm-25500__li6642034202241: + + Check whether the host name in the alarm is consistent with the :ref:`1 ` host name. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-25500__li1133847202241: + + Handle the alarm according to "ALM-12006 Node Fault". + +#. Check whether **KrbServer Service Unavailable** is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether the OLdap service is normal.** + +6. .. _alm-25500__li25144643202241: + + On the **Alarm** page of the FusionInsight Manager system, check whether any alarm of **OLdap Resource Abnormal** exists. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`9 `. + +7. .. _alm-25500__li23450230202241: + + Handle the alarm according to "ALM-12004 OLdap Resource Abnormal". + +8. Check whether **KrbServer Service Unavailable** is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-25500__li11814745202241: + + On the FusionInsight Manager, choose **O&M** > **Log > Download**. + +10. Select **KrbServer** in the required cluster from the **Service**. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417459.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26051_storm_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26051_storm_service_unavailable.rst new file mode 100644 index 0000000..e69c02b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26051_storm_service_unavailable.rst @@ -0,0 +1,150 @@ +:original_name: ALM-26051.html + +.. _ALM-26051: + +ALM-26051 Storm Service Unavailable +=================================== + +Description +----------- + +The system checks the Storm service status every 30 seconds. This alarm is generated when all Nimbus nodes in the cluster are abnormal and the Storm service is unavailable. + +This alarm is cleared when the Storm service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +26051 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide the Storm service, and users cannot perform new Storm tasks. + +Possible Causes +--------------- + +- The Kerberos cluster is faulty. +- The ZooKeeper cluster is faulty or suspended. +- The active and standby Nimbus nodes in the Storm cluster are abnormal + +Procedure +--------- + +**Check the status of the Kerberos cluster. (Skip this step if the normal mode is used.)** + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Check whether the running status of the Kerberos service is **Normal**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-26051__li46537612201537: + + See the related maintenance information of **ALM-25500 KrbServer Service Unavailable**. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check the status of the ZooKeeper cluster.** + +5. .. _alm-26051__li48640467201537: + + Check whether the running status of the ZooKeeper service is **Normal**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-26051__li47563765201537: + + If ZooKeeper service is stopped, start it, else see the related maintenance information of **ALM-13000 ZooKeeper Service Unavailable**. + +7. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check the status of the active and standby Nimbus nodes.** + +8. .. _alm-26051__li55312065201537: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Nimbus** to go to the Nimbus Instances page. + +9. Check whether only one Nimbus node that is in the **Active** state in **Roles**. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`10 `. + +10. .. _alm-26051__li27079930201537: + + Select two Nimbus role instances, choose **More** > **Restart Instance**, and check whether the instances restart successfully. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`13 `. + +11. .. _alm-26051__li54175710201537: + + Log in to the FusionInsight Manager portal again, choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Nimbus** to check whether the running status is **Normal**. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-26051__li14738771201537: + + Wait for 30 seconds and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collecting Fault Information** + +13. .. _alm-26051__li7146378201537: + + On the FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +14. Select the following nodes in the required cluster from the **Service** drop-down list: + + - KrbServer + + .. note:: + + KrbServer logs do not need to be downloaded in normal mode. + + - ZooKeeper + - Storm + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417460.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26052_number_of_available_supervisors_of_the_storm_service_is_less_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26052_number_of_available_supervisors_of_the_storm_service_is_less_than_the_threshold.rst new file mode 100644 index 0000000..5c55ed5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26052_number_of_available_supervisors_of_the_storm_service_is_less_than_the_threshold.rst @@ -0,0 +1,105 @@ +:original_name: ALM-26052.html + +.. _ALM-26052: + +ALM-26052 Number of Available Supervisors of the Storm Service Is Less Than the Threshold +========================================================================================= + +Description +----------- + +The system periodically checks the number of available Supervisors every 60 seconds and compares the number of available Supervisors with the threshold. This alarm is generated when the number of available Supervisors is less than the threshold. + +You can change the threshold in **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster*. + +This alarm is cleared when the number of available Supervisors is greater than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +26052 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Existing tasks in the cluster cannot be performed. The cluster can receive new Storm tasks, but cannot perform these tasks. + +Possible Causes +--------------- + +The status of some Supervisors in the cluster is abnormal. + +Procedure +--------- + +**Check the Supervisor status.** + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Supervisor** to go to the Storm service management page. + +#. In **Roles**, check whether any instance whose status is **Faulty** or **Restoring** exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-26052__li14723901201046: + + Select Supervisor role instances whose status is **Faulty** or **Restoring**, choose **More** > **Restart Instance**, and check whether the instances restart successfully. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-26052__li58537778201046: + + Wait for 30 seconds, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + + .. note:: + + Services are interrupted when the Supervisor is being restarted. Then, services are restored after the restarting. + +**Collect fault information.** + +5. .. _alm-26052__li59911910201046: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +6. Select **Storm** and **ZooKeeper** in the required cluster from the **Service** drop-down list box. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +8. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417461.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26053_storm_slot_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26053_storm_slot_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..9d6be47 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26053_storm_slot_usage_exceeds_the_threshold.rst @@ -0,0 +1,128 @@ +:original_name: ALM-26053.html + +.. _ALM-26053: + +ALM-26053 Storm Slot Usage Exceeds the Threshold +================================================ + +Description +----------- + +The system checks the slot usage every 60 seconds and compares the actual slot usage with the threshold. This alarm is generated when the slot usage is greater than the threshold. + +You can change the threshold in **O&M** > **Alarm** > **Thresholds**. + +This alarm is cleared when the slot usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +26053 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +New Storm tasks cannot be performed. + +Possible Causes +--------------- + +- The status of some Supervisors in the cluster is abnormal. +- The status of all Supervisors is normal, but the processing capability is insufficient. + +Procedure +--------- + +**Check the Supervisor status.** + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Instance** to go to the Storm instance management page. + +#. Check whether any instance whose status is **Faulty** or **Restoring** exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-26053__li3410841620655: + + Select Supervisor role instances whose status is **Faulty** or **Restoring**, choose **More** > **Restart Instance**, and check whether the instances restart successfully. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`10 `. + +#. .. _alm-26053__li6572378120655: + + Wait several minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Increase the number of slots in each Supervisor.** + +5. .. _alm-26053__li4446687120655: + + Log in to the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Configurations** > **All** **Configurations**. + +6. Increase the number of ports in the **supervisor.slots.ports** parameter of each Supervisor role and restart the instance. + +7. Wait several minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +8. .. _alm-26053__li517745320655: + + Perform capacity expansion for Supervisor. + +9. Wait several minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + + .. note:: + + Services are interrupted when the Supervisor is being restarted. Then, services are restored after the restarting. + +**Collect fault information.** + +10. .. _alm-26053__li1692048320655: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +11. Select **Storm** and **ZooKeeper** in the required cluster from the **Service** drop-down list box. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417462.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26054_nimbus_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26054_nimbus_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..e9f549c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-26054_nimbus_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,106 @@ +:original_name: ALM-26054.html + +.. _ALM-26054: + +ALM-26054 Nimbus Heap Memory Usage Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the heap memory usage of Storm Nimbus every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of Storm Nimbus exceeds the threshold (80% of the maximum memory by default) for 5 consecutive times. + +Users can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Storm** > **Nimbus** to change the threshold. + +The alarm is cleared when the heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +26054 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Storm Nimbus is overhigh, frequent GCs occur. In addition, a memory overflow may occur so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Storm Nimbus instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Heap Memory Usage of Storm Nimbus Exceeds the Threshold** > **Location**. Check the host name of the instance for which the alarm is generated. + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down menu in the chart area and choose **Customize** > **Nimbus** > **Heap Memory Usage of Nimbus**. Click **OK**. + +#. Check whether the used heap memory of Nimbus reaches the threshold (The default value is 80% of the maximum heap memory) specified for Nimbus. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-26054__li1368554320217: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Storm** > **Configurations** > **All** **Configurations** > **Nimbus** > **System**. Change the value of **-Xmx** in **NIMBUS_GC_OPTS** based on site requirements, and click **Save**. Click **OK**. + + .. note:: + + - You are advised to set **-Xms** and **-Xmx** to the same value to prevent adverse impact on performance when JVM dynamically adjusts the heap memory size. + - The number of Workers grows as the Storm cluster scale increases. You can increase the value of **GC_OPTS** for Nimbus. The recommended value is as follows: If the number of Workers is 20, set **-Xmx** to a value greater than or equal to 1 GB. If the number of Workers exceeds 100, set **-Xmx** to a value greater than or equal to 5 GB. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-26054__li1113443220217: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select the following node in the required cluster from the **Service** drop-down list. + + - NodeAgent + - Storm + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417463.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27001_dbservice_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27001_dbservice_service_unavailable.rst new file mode 100644 index 0000000..f7cf97d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27001_dbservice_service_unavailable.rst @@ -0,0 +1,185 @@ +:original_name: ALM-27001.html + +.. _ALM-27001: + +ALM-27001 DBService Service Unavailable +======================================= + +Description +----------- + +The alarm module checks the DBService service status every 30 seconds. This alarm is generated when the system detects that DBService service is unavailable. + +This alarm is cleared when DBService service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +27001 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The database service is unavailable and cannot provide data import and query functions for upper-layer services, which results in some services exceptions. + +Possible Causes +--------------- + +- The floating IP address does not exist. +- There is no active DBServer instance. +- The active and standby DBServer processes are abnormal. + +Procedure +--------- + +**Check whether the floating IP address exists in the cluster environment.** + +#. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService** > **Instance**. + +#. Check whether the active instance exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`9 `. + +#. .. _alm-27001__li35546035195610: + + Select the active DBServer instance and record the IP address. + +#. Log in to the host that corresponds to the preceding IP address as user **root**, and run the **ifconfig** command to check whether the DBService floating IP address exists on the node. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`9 `. + +#. .. _alm-27001__li43040801195610: + + Run the **ping** *floatip* command to check whether the DBService floating IP address can be pinged successfully. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`9 `. + +#. .. _alm-27001__li121041629131211: + + Log in to the host that corresponds to the DBService floating IP address as user **root**, and run the command to delete the floating IP address. + + **ifconfig** *interface* **down** + +#. On the FusionInsight Manager home page, choose **Cluster >** *Name of the desired cluster* > **Services** > **DBService** > **More** > **Restart Service** to restart DBService, and check whether DBService is restarted successfully. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`9 `. + +#. .. _alm-27001__li17142731195610: + + Wait for about 2 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`14 `. + +**Check the status of the active DBServer instance.** + +9. .. _alm-27001__li20066855195610: + + Select the DBServer instance whose role status is abnormal and record the IP address. + +10. On the **Alarm** page, check whether **Process Fault** occurs in the DBServer instance on the host that corresponds to the IP address. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`14 `. + +11. .. _alm-27001__li26594651195610: + + Handle the alarm according to "ALM-12007 Process Fault". + +12. Wait for about 5 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +**Check the status of the active and standby DBServers.** + +13. Log in to the host that corresponds to the preceding IP address as user **root**, and run the **su - omm** command to switch to user **omm**. + +14. .. _alm-27001__li6055948195610: + + Run the **cd ${DBSERVER_HOME}** command to go to the installation directory of the DBService. + +15. Run the **sh sbin/status-dbserver.sh** command to view the status of the active and standby HA processes of DBService. Determine whether the status can be viewed successfully. + + .. code-block:: + + HAMode + double + + NodeName HostName HAVersion StartTime HAActive HAAllResOK HARunPhase + 10_5_89_12 host01 V100R001C01 2019-06-13 21:33:09 active normal Actived + 10_5_89_66 host03 V100R001C01 2019-06-13 21:33:09 standby normal Deactived + + NodeName ResName ResStatus ResHAStatus ResType + 10_5_89_12 floatip Normal Normal Single_active + 10_5_89_12 gaussDB Active_normal Normal Active_standby + 10_5_89_66 floatip Stopped Normal Single_active + 10_5_89_66 gaussDB Standby_normal Normal Active_standby + + - If yes, go to :ref:`16 `. + - If no, go to :ref:`19 `. + +16. .. _alm-27001__li56882203195610: + + Check whether the active and standby HA processes are in the abnormal state. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`19 `. + +17. .. _alm-27001__li30245369195610: + + On FusionInsight Manager, choose **Cluster >** *Name of the desired cluster* > **Services** > **DBService** > **More** > **Restart Service** to restart DBService, and check whether the system displays a message indicating that the restart is successful. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`19 `. + +18. .. _alm-27001__li50093336195610: + + Wait for about 2 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +**Collect fault information.** + +19. .. _alm-27001__li10820419195610: + + On FusionInsight Manager, choose **O&M** > **Log > Download**. + +20. Select **DBService** in the required cluster and **NodeAgent** from the **Service**. + +21. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +22. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417464.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst new file mode 100644 index 0000000..74cdc71 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst @@ -0,0 +1,107 @@ +:original_name: ALM-27003.html + +.. _ALM-27003: + +ALM-27003 DBService Heartbeat Interruption Between the Active and Standby Nodes +=============================================================================== + +Description +----------- + +This alarm is generated when the active or standby DBService node does not receive heartbeat messages from the peer node for 7 seconds. + +This alarm is cleared when the heartbeat recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +27003 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------------+---------------------------------------------------------+ +| Name | Meaning | ++=========================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| Local DBService HA Name | Specifies a local DBService HA. | ++-------------------------+---------------------------------------------------------+ +| Peer DBService HA Name | Specifies a peer DBService HA. | ++-------------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +During the DBService heartbeat interruption, only one node can provide the service. If this node is faulty, no standby node is available for failover and the service is unavailable. + +Possible Causes +--------------- + +The link between the active and standby DBService nodes is abnormal. + +Procedure +--------- + +**Check whether the network between the active DBService server and the standby DBService server is normal.** + +#. In the alarm list on FusionInsight Manager, click |image1| in the row where the alarm is located in the real-time alarm list and view the standby DBService server address. +#. Log in to the active DBService server as user **root**. + +3. Run the **ping** *standby DBService heartbeat IP address* command to check whether the standby DBService server is reachable. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +4. .. _alm-27003__li25387710195327: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm-27003__li34675550195327: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-27003__li45543702195327: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +7. Select the following nodes in the required cluster from the **Service**: + + - DBService + - Controller + - NodeAgent + +8. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417465.png +.. |image2| image:: /_static/images/en-us_image_0269417466.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst new file mode 100644 index 0000000..ca35987 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst @@ -0,0 +1,159 @@ +:original_name: ALM-27004.html + +.. _ALM-27004: + +ALM-27004 Data Inconsistency Between Active and Standby DBServices +================================================================== + +Description +----------- + +The system checks the data synchronization status between the active and standby DBService every 10 seconds. This alarm is generated when the synchronization status cannot be queried for six consecutive times or when the synchronization status is abnormal. + +This alarm is cleared when the synchronization status becomes normal. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +27004 Critical Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------------+---------------------------------------------------------+ +| Name | Meaning | ++=========================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| Local DBService HA Name | Specifies the HA name of the local DBService. | ++-------------------------+---------------------------------------------------------+ +| Peer DBService HA Name | Specifies the HA name of the peer DBService. | ++-------------------------+---------------------------------------------------------+ +| SYNC_PERCENT | Specifies the synchronization percentage. | ++-------------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +When data is not synchronized between the active and standby DBServices, data may be lost or abnormal if the active instance becomes abnormal. + +Possible Causes +--------------- + +- The network between the active and standby nodes is unstable. +- The standby DBService is abnormal. +- The standby node disk space is full. +- The CPU usage of the GaussDB process on the active DBService node is high. You need to locate the failure cause based on logs. + +Procedure +--------- + +**Check whether the network between the active and standby nodes is normal.** + +#. On FusionInsight Manager, choose **Cluster > Services > DBService > Instance**, check the service IP address of the standby DBServer instance. + +#. Log in to the active DBService node as user **root**. + +#. Run the **ping** *Standby DBService heartbeat IP address* command to check whether the standby DBService node is reachable. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`4 `. + +#. .. _alm-27004__li40047077195034: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm-27004__li22196300195034: + + Rectify the network fault and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Check whether the standby DBService is normal.** + +6. .. _alm-27004__li53069850195034: + + Log in to the standby DBService node as user **root**. + +7. Run the **su - omm** command to switch to user **omm**. + +8. Go to the **${DBSERVER_HOME}/sbin** directory and run the **./status-dbserver.sh** command to check whether the GaussDB resource status of the standby DBService is normal. In the command output, check whether the following information is displayed in the row where **ResName** is **gaussDB**: + + For example: + + .. code-block:: + + 10_10_10_231 gaussDB Standby_normal Normal Active_standby + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`16 `. + +**Check whether the standby node disk space is full.** + +9. .. _alm-27004__li13084069195034: + + Log in to the standby DBService node as user **root**. + +10. Run the **su - omm** command to switch to user **omm**. + +11. Go to the **${DBSERVER_HOME}** directory, and run the following commands to obtain the DBService data directory: + + **cd ${DBSERVER_HOME}** + + **source .dbservice_profile** + + **echo ${DBSERVICE_DATA_DIR}** + +12. Run the **df -h** command to view the system disk partition usage information. + +13. Check whether the DBService data directory space is full. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`16 `. + +14. .. _alm-27004__li255547195034: + + Expand the disk capacity. + +15. After the disk capacity is expanded, wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`16 `. + +**Collect fault information.** + +16. .. _alm-27004__li45928829195034: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +17. In the **Service** area, select **DBService** of the target cluster and **OS**, **OS Statistics**, and **OS Performance** under **OMS**, and click **OK**. + +18. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +19. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417468.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27005_database_connections_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27005_database_connections_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b39d9e7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27005_database_connections_usage_exceeds_the_threshold.rst @@ -0,0 +1,156 @@ +:original_name: ALM-27005.html + +.. _ALM-27005: + +ALM-27005 Database Connections Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the usage of the number of database connections of the nodes where DBServer instances are located every 30 seconds and compares the usage with the threshold. If the usage exceeds the threshold for five consecutive times (this number is configurable, and 5 is the default value), the system generates this alarm. The default usage threshold is 90%, and you can configure it based on site requirements. + +The trigger count is configurable. This alarm is cleared in the following scenarios: + +- The trigger count is 1, and the usage of the number of database connections is less than or equal to the threshold. +- The trigger count is greater than 1, and the usage of the number of database connections is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +27005 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Upper-layer services may fail to connect to the DBService database, affecting services. + +Possible Causes +--------------- + +- Too many database connections are used. +- The maximum number of database connections is improperly configured. +- The alarm threshold or alarm trigger count is improperly configured. + +Procedure +--------- + +**Checking whether too many data connections are used** + +#. On FusionInsight Manager, click DBService in the service list on the left navigation pane. The DBService monitoring page is displayed. + +#. Observe the number of connections used by the database user, as shown in :ref:`Figure 1 `. Based on the service scenario, reduce the number of database user connections. + + .. _alm-27005__fig821715142011: + + .. figure:: /_static/images/en-us_image_0269417469.png + :alt: **Figure 1** Number of connections used by database users + + **Figure 1** Number of connections used by database users + +#. Wait for 2 minutes and check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`4 `. + +**Checking whether the maximum number of database connections is properly configured** + +4. .. _alm-27005__li96747179515: + + Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService** > **Configurations**. On the displayed page, select the **All** **Configurations** tab, and increase the maximum number of database connections based on service requirements, as shown in :ref:`Figure 2 `. Click **Save**. In the displayed **Save configuration** dialog box, click **OK**. + + .. _alm-27005__fig1567417179514: + + .. figure:: /_static/images/en-us_image_0000001086795516.png + :alt: **Figure 2** Setting the maximum number of database connections + + **Figure 2** Setting the maximum number of database connections + +5. After the maximum number of database connections is changed, restart DBService (do not restart the upper-layer services). + + Procedure: Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService**. On the displayed page, choose **More** > **Restart** **Service**. Enter the password of the current login user and click **OK**. Do not select **Restart upper-layer services.**, click **OK**. + +6. After the service is restarted, wait for 2 minutes and check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Checking whether the alarm threshold or trigger count is properly configured** + +7. .. _alm-27005__li66961681117: + + Log in to FusionInsight Manager and change the alarm threshold and alarm trigger count based on the actual database connection usage. + + Choose **O&M**> **Alarm** > **Thresholds** > *Name of the desired cluster >* **DBService > Database > Database Connections Usage (DBServer)**. In the **Database Connections Usage (DBServer)** area, click the pencil icon next to **Trigger Count**. In the displayed dialog box, change the trigger count, as shown in :ref:`Figure 3 `. + + .. note:: + + **Trigger Count**: If the usage of the number of database connections exceeds the threshold consecutively for more than the value of this parameter, an alarm is generated. + + .. _alm-27005__fig14885145323516: + + .. figure:: /_static/images/en-us_image_0000001133372255.png + :alt: **Figure 3** Setting alarm trigger count + + **Figure 3** Setting alarm trigger count + + Based on the actual database connection usage, choose **O&M** >\ **Alarm** > **Thresholds** > *Name of the desired cluster* > **DBService > Database > Database Connections Usage (DBServer)**. In the **Database Connections Usage (DBServer)** area, click **Modify** in the **Operation** column. In the **Modify Rule** dialog box, modify the required parameters and click **OK** as shown in :ref:`Figure 4 `. + + .. _alm-27005__fig19690175212407: + + .. figure:: /_static/images/en-us_image_0000001390936180.png + :alt: **Figure 4** Set alarm threshold + + **Figure 4** Set alarm threshold + +8. Wait for 2 minutes and check whether the alarm is automatically cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`9 `. + +**Collect fault information** + +9. .. _alm-27005__li195612031415: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +10. Select **DBService** in the required cluster from the **Service**. + +11. Specify the host for collecting logs by setting the **Host** parameter that is optional. By default, all hosts are selected. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417473.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27006_disk_space_usage_of_the_data_directory_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27006_disk_space_usage_of_the_data_directory_exceeds_the_threshold.rst new file mode 100644 index 0000000..864879b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27006_disk_space_usage_of_the_data_directory_exceeds_the_threshold.rst @@ -0,0 +1,129 @@ +:original_name: ALM-27006.html + +.. _ALM-27006: + +ALM-27006 Disk Space Usage of the Data Directory Exceeds the Threshold +====================================================================== + +Description +----------- + +The system checks the disk space usage of the data directory on the active DBServer node every 30 seconds and compares the disk usage with the threshold. The alarm is generated when the disk space usage exceeds the threshold for five consecutive times (the default value). The number of consecutive times is configurable. The disk space usage threshold of the data directory is set to 80% by default, which is configurable as well. + +The value of **hit number** is configurable. When the value is set to **1** and the disk space usage is lower than or equal to the threshold, the alarm is cleared. When the value is greater than 1 and the disk space usage is lower than 90% of the threshold, the alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +27006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================================================================================+ +| ClusterName | Specifies the cluster for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| PartitionName | Specifies the disk partition where the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the actual indicator value exceeds this threshold, the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- Service processes become unavailable. +- When the disk space usage of the data directory exceeds 90%, the database reports the "Database Enters the Read-Only Mode" alarm and enters the read-only mode, which may cause service data loss. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The data volume of the database is too large or the disk configuration cannot meet service requirements, causing excessive disk usage. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **DBService** > **Database** > **Disk Space Usage of the Data Directory** to check whether the alarm threshold is proper (the default value 80% is a proper value). + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm-27006__li165311142601: + + Change the alarm threshold based on the actual service situation. + +#. .. _alm-27006__li165316427014: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService**. On the **Dashboard** page, view the **Disk Space Usage of the Data Directory** chart and check whether the disk space usage of the data directory is lower than the threshold. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _alm-27006__li1553118426012: + + Wait 2 minutes and check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + + **Check whether large files are incorrectly written into the disk.** + +#. .. _alm-27006__li1453211421204: + + Log in to the active DBService node as user **omm**. + +#. Run the following commands to view the files whose size exceeds 500 MB in the data directory and check whether there are large files incorrectly written into the directory: + + **source $DBSERVER_HOME/.dbservice_profile** + + **find "$DBSERVICE_DATA_DIR"/../ -type f -size +500M** + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +#. .. _alm-27006__li1453214421204: + + Handle the large files based on the actual scenario and check whether the alarm is cleared 2 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + + **Collect fault information.** + +#. .. _alm-27006__li1853216425019: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +#. Expand the **Service** drop-down list, and select **DBService** for the target cluster. + +#. Specify the host for collecting logs by setting the **Host** parameter which is optional. By default, all hosts are selected. + +#. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +#. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269623978.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27007_database_enters_the_read-only_mode.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27007_database_enters_the_read-only_mode.rst new file mode 100644 index 0000000..40d8313 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-27007_database_enters_the_read-only_mode.rst @@ -0,0 +1,152 @@ +:original_name: ALM-27007.html + +.. _ALM-27007: + +ALM-27007 Database Enters the Read-Only Mode +============================================ + +Description +----------- + +The system checks the disk space usage of the data directory on the active DBServer node every 30 seconds. The alarm is generated when the disk space usage exceeds 90%. + +The alarm is cleared when the disk space usage is lower than 80%. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +27007 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=============================================================================================================================+ +| ClusterName | Specifies the cluster for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the actual indicator value exceeds this threshold, the alarm is generated. | ++-------------------+-----------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The database enters the read-only mode, causing service data loss. + +Possible Causes +--------------- + +The disk configuration cannot meet service requirements. The disk usage reaches the upper limit. + +Procedure +--------- + +**Check whether the disk space usage reaches the upper limit.** + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService**. + +#. On the **Dashboard** page, view the **Disk Space Usage of the Data Directory** chart and check whether the disk space usage of the data directory exceeds 90%. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`13 `. + +#. .. _alm-27007__li203461531006: + + Log in to the active management node of the DBServer as user **omm** and run the following commands to check whether the database enters the read-only mode: + + **source $DBSERVER_HOME/.dbservice_profile** + + **gsql -U omm -W** *password* **-d postgres -p 20051** + + **show default_transaction_read_only;** + + .. note:: + + In the preceding commands, *password* indicates the password of user **omm** of the DBService database. You can run the **\\q** command to exit the database. + + Check whether the value of **default_transaction_read_only** is **on**. + + .. code-block:: text + + POSTGRES=# show default_transaction_read_only; + default_transaction_read_only + ------------------------------- + on + (1 row) + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`13 `. + +#. .. _alm-27007__li1234615311708: + + Run the following commands to open the **dbservice.properties** file: + + **source $DBSERVER_HOME/.dbservice_profile** + + **vi ${DBSERVICE_SOFTWARE_DIR}/tools/dbservice.properties** + +#. Change the value of **gaussdb_readonly_auto** to **OFF**. + +#. Run the following command to open the **postgresql.conf** file: + + **vi ${DBSERVICE_DATA_DIR**}\ **/postgresql.conf** + +#. Delete **default_transaction_read_only = on**. + +#. Run the following command for the configuration to take effect: + + **gs_ctl reload -D ${DBSERVICE_DATA_DIR**} + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. On the right of the alarm "Database Enters the Read-Only Mode", click **Clear** in the **Operation** column. In the dialog box that is displayed, click **OK** to manually clear the alarm. + +#. Log in to the active management node of the DBServer as user **omm** and run the following commands to view the files whose size exceeds 500 MB in the data directory and check whether there are large files incorrectly written into the directory: + + **source $DBSERVER_HOME/.dbservice_profile** + + **find "$DBSERVICE_DATA_DIR"/../ -type f -size +500M** + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`13 `. + +#. .. _alm-27007__li534815311101: + + Handle the files that are incorrectly written into the directory based on the actual scenario. + +#. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **DBService**. On the **Dashboard** page, view the **Disk Space Usage of the Data Directory** chart and check whether the disk space usage is lower than 80%. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collect fault information.** + +13. .. _alm-27007__li133383310015: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +14. Expand the **Service** drop-down list, and select **DBService** for the target cluster. + +15. Specify the host for collecting logs by setting the **Host** parameter which is optional. By default, all hosts are selected. + +16. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +17. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269624001.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38000_kafka_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38000_kafka_service_unavailable.rst new file mode 100644 index 0000000..03da0d6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38000_kafka_service_unavailable.rst @@ -0,0 +1,137 @@ +:original_name: ALM-38000.html + +.. _ALM-38000: + +ALM-38000 Kafka Service Unavailable +=================================== + +Description +----------- + +The system checks the Kafka service status every 30 seconds. This alarm is generated when the Kafka service is unavailable. + +This alarm is cleared when the Kafka service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38000 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide the Kafka service, and users cannot perform new Kafka tasks. + +Possible Causes +--------------- + +- The KrbServer service is abnormal.(Skip this step if the normal mode is used.) +- The ZooKeeper service is abnormal or does not respond. +- The Broker instance in the Kafka cluster are abnormal. + +Procedure +--------- + +**Check the status of the KrbServer service. (Skip this step if the normal mode is used.)** + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services > KrbServer**. + +#. .. _alm-38000__li9636121161118: + + Check whether the running status of the KrbServer service is **Normal**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm-38000__li42328297161118: + + Rectify the fault by following the steps provided in **ALM-25500 KrbServer Service Unavailable**. + +#. Perform :ref:`2 ` again. + +**Check the status of the ZooKeeper cluster.** + +5. .. _alm-38000__li23201109161118: + + Check whether the running status of the ZooKeeper service is **Normal**. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +6. .. _alm-38000__li241672161118: + + If ZooKeeper service is stopped, start it, else rectify the fault by following the steps provided in **ALM-13000 ZooKeeper Service Unavailable**. + +7. Perform :ref:`5 ` again. + +**Check the Broker status.** + +8. .. _alm-38000__li29459861161118: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance** to go to the Kafka instances page. + +9. Check whether all instances in **Roles** are running properly. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`10 `. + +10. .. _alm-38000__li1856436161118: + + Select all Broker instances, choose **More** > **Restart Instance**, and check whether the instances restart successfully. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`13 `. + +11. .. _alm-38000__li65780611161118: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** to check whether the running status is **Normal**. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +12. .. _alm-38000__li30279690161118: + + Wait for 30 seconds and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collecting Fault Information** + +13. .. _alm-38000__li62186744161118: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +14. Select **Kafka** in the required cluster from the **Service** drop-down list. + +15. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +16. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417499.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38001_insufficient_kafka_disk_capacity.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38001_insufficient_kafka_disk_capacity.rst new file mode 100644 index 0000000..bdc688e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38001_insufficient_kafka_disk_capacity.rst @@ -0,0 +1,258 @@ +:original_name: ALM-38001.html + +.. _ALM-38001: + +ALM-38001 Insufficient Kafka Disk Capacity +========================================== + +Description +----------- + +The system checks the Kafka disk usage every 60 seconds and compares the actual disk usage with the threshold. The disk usage has a default threshold. This alarm is generated when the disk usage is greater than the threshold. + +You can change the threshold in **O&M** > **Alarm** > **Thresholds**. Under the service list, choose **Kafka > Disk > Broker Disk Usage (Broker)** and change the threshold. + +When the **Trigger Count** is 1, this alarm is cleared when the Kafka disk usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the Kafka disk usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38001 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| PartitionName | Specifies the disk partition where the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Kafka data write operations are affected. + +Possible Causes +--------------- + +- The configuration (such as number and size) of the disks for storing Kafka data cannot meet the requirement of the current service traffic, due to which the disk usage reaches the upper limit. +- Data retention time is too long, due to which the data disk usage reaches the upper limit. +- The service plan does not distribute data evenly, due to which the usage of some disks reaches the upper limit. + +Procedure +--------- + +**Check the disk configuration of Kafka data.** + +#. On the FusionInsight Manager portal and click **O&M** > **Alarm** > **Alarms**. + +#. .. _alm-38001__lad42a611d3f7476ebbdcc2c5b11228e3: + + In the alarm list, locate the alarm and obtain **HostName** from **Location**. + +#. Click **Cluster** > *Name of the desired cluster* > **Hosts**. + +#. In the host list, click the host name obtained in :ref:`2 `. + +#. Check whether the **Disk** area contains the partition name in the alarm. + + - If yes, go to :ref:`6 `. + - If no, manually clear the alarm and no further operation is required. + +#. .. _alm-38001__en-us_topic_0070543586_step6a: + + Check whether the disk partition usage contained in the alarm reaches 100% in the **Disk** area. + + - If yes, handle the alarm by following the instructions in :ref:`Related Information `. + - If no, go to :ref:`7 `. + +**Check the Kafka data storage duration.** + +7. .. _alm-38001__li460742117451: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Configurations**. + +8. Check whether the value of parameter **disk.adapter.enable** is set to **true**. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`9 `. + +9. .. _alm-38001__li96071921164519: + + Set the value of **disk.adapter.enable** to **true**. Check whether the value of **adapter.topic.min.retention.hours** is properly set. + + - If yes, go to :ref:`10 `. + - If no, adjust the data retention period based on service requirements. + + .. important:: + + If the disk auto-adaptation function is enabled, some historical data of specified topics is deleted. If the retention period of some topics cannot be adjusted, click **All Configurations** and add the topics to the value of the **disk.adapter.topic.blacklist** parameter. + +10. .. _alm-38001__li0608162194512: + + Wait 10 minutes and check whether the usage of faulty disks reduces. + + - If yes, wait until the alarm is cleared. + - If no, go to :ref:`11 `. + +**Check the Kafka data plan.** + +11. .. _alm-38001__li146841724812: + + In the **Instance** area, click **Broker**. In the **Real Time** area of Broker, Click the drop-down menu in the Chart area and choose **Customize** to customize monitoring items. + +12. .. _alm-38001__li1681217164815: + + In the dialog box, select **Disk** > **Broker Disk Usage** and click **OK**. + + The Kafka disk usage information is displayed. + +13. View the information in :ref:`12 ` to check whether there is only the disk parathion for which the alarm is generated in :ref:`2 `. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm-38001__li76811719488: + + Perform disk planning and mount a new disk again. Go to the **Instance Configurations** page of the node for which the alarm is generated, modify **log.dirs**, add other disk directories, and restart the Kafka instance. + +15. .. _alm-38001__li4681517154817: + + Determine whether to shorten the data retention time configured on Kafka based on service requirements and service traffic. + + - If yes, go to :ref:`16 `. + - If no, go to :ref:`17 `. + +16. .. _alm-38001__li3691217164814: + + Log in to FusionInsight Manager, select **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Configurations**, and click **All Configurations**. In the search box on the right, enter **log.retention.hours**. The value of the parameter indicates the default data retention time of the topic. You can change the value to a smaller one. + + .. note:: + + - For a topic whose data retention time is configured alone, the modification of the data retention time on the Kafka Service Configuration page does not take effect. + + - To modify the data retention time for a topic, use the Kafka client command-line interface (CLI) to configure the topic. + + Example: **kafka-topics.sh --zookeeper "**\ *ZooKeeper IP address*\ **:2181/kafka" --alter --topic "**\ *Topic bane*\ **" --config retention.ms= "**\ *retention time*\ **"** + +17. .. _alm-38001__li86921715482: + + Check whether the usage of some disks reaches the upper limit due to unreasonable configuration of the partitions of some topics. For example, the number of partitions configured for a topic with large data volume is smaller than the number of disks. In this case, the data is not evenly allocated to disks. + + .. note:: + + If you do not know which topic has large data volume, you can log in to an instance node based on the host node information obtained in :ref:`2 `, and go to the data directory (directory specified by **log.dirs** before the modification in :ref:`14 `) to check whether there is topic with partition that use large disk space. + + - If yes, go to :ref:`18 `. + - If no, go to :ref:`19 `. + +18. .. _alm-38001__li106991718484: + + In the Kafka client CLI, run the following command to perform partition capacity expansion for the topic: + + **kafka-topics.sh --zookeeper "**\ *ZooKeeper IP address*\ **:2181/kafka" --alter --topic "**\ *Topic name*\ **" --partitions="**\ *New number of partitions*\ **"** + + .. note:: + + - You are advised to set the new number of partitions to a multiple of the number of Kafka data disks. + - The step may not quickly clear the alarm, and you need to modify the data retention time in :ref:`11 ` to gradually balance data allocation. + +19. .. _alm-38001__li6701817194816: + + Determine whether to perform capacity expansion. + + .. note:: + + You are advised to perform capacity expansion for Kafka when the current disk usage exceeds 80%. + + - If yes, go to :ref:`20 `. + - If no, go to :ref:`21 `. + +20. .. _alm-38001__li1670517124811: + + Expand the disk capacity and check whether the alarm is cleared after capacity expansion. + + - If yes, no further action is required. + - If no, go to :ref:`22 `. + +21. .. _alm-38001__li1170111717488: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`22 `. + +**Collect fault information.** + +22. .. _alm-38001__li1311215881510: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +23. Select **Kafka** in the required cluster from the **Service** drop-down list. + +24. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +25. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +.. _alm-38001__sc5883464b7cf4074a814e9859261a5c6: + +Related Information +------------------- + +#. Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**, stop the Broker instance whose status is **Restoring**, record the management IP address of the node where the Broker instance is located, and record **broker.id**. The value can be obtained by using the following method: Click the role name. On the **Configurations** page, select **All Configurations**, and search for the **broker.id** parameter. + +#. Log in to the recorded management IP address as user **root**, and run the **df -lh** command to view the mounted directory whose disk usage is 100%, for example, **${BIGDATA_DATA_HOME}/kafka/data1**. + +#. Go to the directory, run the **du -sh \*** command to view the size of each file in the directory,check whether files other than **kafka-logs** exist, and determine whether these files can be deleted or migrated. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`4 `. + +#. .. _alm-38001__l6b4a3aa101714691aebfd7f69ccfc8d4: + + Go to the **kafka-logs** directory, run the **du -sh \*** command, select a partition folder to be moved. The naming rule is **Topic name-Partition ID**. Record the topic and partition. + +#. .. _alm-38001__l847204e787034666b0ffc45eaaaf2cd4: + + Modify the **recovery-point-offset-checkpoint** and **replication-offset-checkpoint** files in the **kafka-logs** directory in the same way. + + a. Decrease the number in the second line in the file. (To remove multiple directories, the number deducted is equal to the number of files to be removed.) + b. Delete the line of the to-be-removed partition. (The line structure is "Topic name Partition ID Offset". Save the data before deletion. Subsequently, the content must be added to the file of the same name in the destination directory.) + +#. Modify the **recovery-point-offset-checkpoint** and **replication-offset-checkpoint** files in the destination data directory. For example, **${BIGDATA_DATA_HOME}/kafka/data2/kafka-logs** in the same way. + + - Increase the number in the second line in the file. (To move multiple directories, the number added is equal to the number of files to be moved.) + - Add the to-be moved partition to the end of the file. (The line structure is "Topic name Partition ID Offset". You can copy the line data saved in :ref:`5 `.) + +#. Move the partition to the destination directory. After the partition is moved, run the **chown omm:wheel -R** *Partition directory* command to modify the directory owner group for the partition. + +#. .. _alm-38001__le5f408260b7c4eaea839d9f216e3039b: + + Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance** to start the Broker instance. + +#. Wait for 5 to 10 minutes and check whether the health status of the Broker instance is **Normal**. + + - If yes, resolve the disk capacity insufficiency problem according to the handling method of "ALM-38001 Insufficient Kafka Disk Space" after the alarm is cleared. + - If no, contact the O&M personnel. + +.. |image1| image:: /_static/images/en-us_image_0269417500.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38002_kafka_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38002_kafka_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..fc597b5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38002_kafka_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,105 @@ +:original_name: ALM-38002.html + +.. _ALM-38002: + +ALM-38002 Kafka Heap Memory Usage Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the Kafka service status every 30 seconds. The alarm is generated when the heap memory usage of a Kafka instance exceeds the threshold (95% of the maximum memory) for 10 consecutive times. + +When the **Trigger Count** is 1, this alarm is cleared when the heap memory usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the heap memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38002 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available Kafka heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the Kafka instance is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Kafka Heap Memory Usage Exceeds the Threshold** > **Location**. Check the host name of the instance involved in this alarm. + +#. .. _alm-38002__li118928315563: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Process** > **Heap Memory Usage of Kafka**, and click **OK**. + +#. Check whether the used heap memory of Kafka reaches 95% of the maximum heap memory specified for Kafka. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +**Check the heap memory size configured for Kafka.** + +4. .. _alm-38002__li1593445465720: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Configurations** > **All** **Configurations**> **Broker(Role)** > **Environment**. Increase the value of **KAFKA_HEAP_OPTS** by referring to the Note. + + .. note:: + + - It is recommended that **-Xmx** and **-Xms** be set to the same value. + - You are advised to view **Heap Memory Usage of Kafka** by referring to :ref:`2 `, and set the value of **KAFKA_HEAP_OPTS** to twice the value of **Heap Memory Used by Kafka.** + +5. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-38002__li3623590715563: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +7. Select **Kafka** in the required cluster from the **Service** drop-down list. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417501.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38004_kafka_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38004_kafka_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..387685e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38004_kafka_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,107 @@ +:original_name: ALM-38004.html + +.. _ALM-38004: + +ALM-38004 Kafka Direct Memory Usage Exceeds the Threshold +========================================================= + +Description +----------- + +The system checks the direct memory usage of the Kafka service every 30 seconds. This alarm is generated when the direct memory usage of a Kafka instance exceeds the threshold (80% of the maximum memory) for 10 consecutive times. + +When the **Trigger Count** is 1, this alarm is cleared when the direct memory usage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the direct memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38004 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available direct memory of the Kafka service is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the Kafka instance is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Kafka Direct Memory Usage Exceeds the Threshold** > **Location** to check the host name of the instance for which the alarm is generated. + +#. .. _alm-38004__li28837961155343: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down menu in the Chart area and choose **Customize** > **Process** > **Kafka** **Direct Memory Usage**, and click **OK**. + +#. Check whether the used direct memory of Kafka reaches 80% of the maximum direct memory specified for Kafka. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +**Check the direct memory size configured for the Kafka.** + +4. .. _alm-38004__li11113491818: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Configurations** > **All** **Configurations** > **Broker(Role)** > **Environment** to increase the value of **-Xmx** configured in the **KAFKA_HEAP_OPTS** parameter by referring to the Note. + + .. note:: + + - It is recommended that **-Xmx** and **-Xms** be set to the same value. + - You are advised to view **Kafka** **Direct Memory Usage** by referring to :ref:`2 `, and set the value of **KAFKA_HEAP_OPTS** to twice the value of **Direct Memory Used by Kafka.** + +5. Save the configuration and restart the Kafka service. + +6. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-38004__li52892950155343: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +8. Select **Kafka** in the required cluster from the **Service** drop-down list. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417502.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38005_gc_duration_of_the_broker_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38005_gc_duration_of_the_broker_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..d78567b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38005_gc_duration_of_the_broker_process_exceeds_the_threshold.rst @@ -0,0 +1,106 @@ +:original_name: ALM-38005.html + +.. _ALM-38005: + +ALM-38005 GC Duration of the Broker Process Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the garbage collection (GC) duration of the Broker process every 60 seconds. This alarm is generated when the GC duration exceeds the threshold (12 seconds by default) for 3 consecutive times. + +When the **Trigger Count** is 1, this alarm is cleared when the GC duration is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the GC duration is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38005 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +A long GC duration of the Broker process may interrupt the services. + +Possible Causes +--------------- + +The Kafka GC duration of the node is too long or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **GC Duration of the Broker Process Exceeds the Threshold** > **Location**. Check the host name of the instance involved in this alarm. +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Process** > **Broker GC Duration per Minute**, and click **OK**. +#. Check whether the GC duration of the Broker process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`7 `. + +**Check the direct memory size configured for the Kafka.** + +4. .. _alm-38005__li759117561678: + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Configurations** > **All** **Configurations** > **Broker(Role)** > **Environment** to increase the value of **-Xmx** configured in the **KAFKA_HEAP_OPTS** parameter by referring to the Note. + + .. note:: + + - It is recommended that **-Xmx** and **-Xms** be set to the same value. + + - You are advised to set the value of **KAFKA_HEAP_OPTS** to twice the value of **Direct Memory Used by Kafka.** + + On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. Click the instance for which the alarm is generated to go to the page for the instance. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **Process** > **Kafka Direct Memory Resource Status** to check the value of **Direct Memory Used by Kafka**. + +5. Save the configuration and restart the Kafka service. + +6. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-38005__li3395370155050: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +8. Select **Kafka** in the required cluster from the **Service** drop-down list. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417503.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38006_percentage_of_kafka_partitions_that_are_not_completely_synchronized_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38006_percentage_of_kafka_partitions_that_are_not_completely_synchronized_exceeds_the_threshold.rst new file mode 100644 index 0000000..207c91d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38006_percentage_of_kafka_partitions_that_are_not_completely_synchronized_exceeds_the_threshold.rst @@ -0,0 +1,109 @@ +:original_name: ALM-38006.html + +.. _ALM-38006: + +ALM-38006 Percentage of Kafka Partitions That Are Not Completely Synchronized Exceeds the Threshold +=================================================================================================== + +Description +----------- + +The system checks the percentage of Kafka partitions that are not completely synchronized to the total number of partitions every 60 seconds. This alarm is generated when the percentage exceeds the threshold (50% by default) for 3 consecutive times. + +When the **Trigger Count** is 1, this alarm is cleared when the percentage is less than or equal to the threshold. When the **Trigger Count** is greater than 1, this alarm is cleared when the percentage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38006 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Too many Kafka partitions that are not completely synchronized affect service reliability. In addition, data may be lost when leaders are switched. + +Possible Causes +--------------- + +Some nodes where the Broker instance resides are abnormal or stop running. As a result, replicas of some partitions in Kafka are out of the in-sync replicas (ISR) set. + +Procedure +--------- + +**Check Broker instances.** + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. The Kafka instances page is displayed. + +#. .. _alm-38006__li1743646315478: + + Check whether faulty nodes exist among all Broker nodes. + + - If yes, record the host name of the node and go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-38006__li2760667615478: + + On the FusionInsight Manager portal, click **O&M** > **Alarm** > **Alarms** to check whether the fault described in :ref:`2 ` exists in the alarm information and handle the alarm based on corresponding methods. + +#. On the FusionInsight Manager portal, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. The Kafka instances page is displayed. + +#. .. _alm-38006__li6648467215478: + + Check whether stopped nodes exist among all Broker instance. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +#. .. _alm-38006__li1472641115478: + + Select all stopped Broker instances and click **Start Instance**. + +#. .. _alm-38006__li5037705315478: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-38006__li1632355415478: + + On the FusionInsight Manager portal, choose **O&M** > **Log** > **Download**. + +9. Select **Kafka** in the required cluster from the **Service** drop-down list. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417504.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38007_status_of_kafka_default_user_is_abnormal.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38007_status_of_kafka_default_user_is_abnormal.rst new file mode 100644 index 0000000..8a92c65 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38007_status_of_kafka_default_user_is_abnormal.rst @@ -0,0 +1,114 @@ +:original_name: ALM-38007.html + +.. _ALM-38007: + +ALM-38007 Status of Kafka Default User Is Abnormal +================================================== + +Description +----------- + +The system checks the default user of Kafka every 60 seconds. This alarm is generated when the system detects that the user status is abnormal. + +**Trigger Count** is set to **1**. This alarm is cleared when the user status becomes normal. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38007 Critical Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| HostName | Specifies the host name for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| Trigger Condition | Specifies the condition that the Kafka default user status is abnormal. | ++-------------------+-------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the Kafka default user status is abnormal, metadata synchronization between Brokers and interaction between Kafka and ZooKeeper will be affected, affecting service production, consumption, and topic creation and deletion. + +Possible Causes +--------------- + +- The Sssd service is abnormal. +- Some Broker instances stop running. + +Procedure +--------- + +**Check whether the Sssd service is abnormal.** + +#. On the FusionInsight Manager portal, choose **O&M** > **Alarm** > **Alarms** > **Status of Kafka Default User Is Abnormal** > **Location** to check the host name of the instance for which the alarm is generated. + +#. Find the host information in the alarm information and log in to the host. + +#. Run the **id -Gn kafka** command and check whether "No such user" is displayed in the command output. + + - If yes, record the host name of the node and go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-38007__li1465202115440: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Alarms**. Check whether there is **Sssd Service Exception** in the alarm information. If there is, handle the alarm based on alarm information. + +**Check the running status of the Broker instance.** + +5. On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. The Kafka instance page is displayed. + +6. .. _alm-38007__li14809510142018: + + Check whether there are stopped nodes on all Broker instances. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +7. .. _alm-38007__li9809111022018: + + Select all stopped Broker instances and click **Start Instance**. + +8. .. _alm-38007__li4809151011206: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-38007__li783366415440: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +10. In the **Service** area, select **Kafka** in the required cluster. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417505.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38008_abnormal_kafka_data_directory_status.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38008_abnormal_kafka_data_directory_status.rst new file mode 100644 index 0000000..248cf8e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38008_abnormal_kafka_data_directory_status.rst @@ -0,0 +1,116 @@ +:original_name: ALM-38008.html + +.. _ALM-38008: + +ALM-38008 Abnormal Kafka Data Directory Status +============================================== + +Description +----------- + +The system checks the Kafka data directory status every 60 seconds. This alarm is generated when the system detects that the status of a data directory is abnormal. + +**Trigger Count** is set to **1**. This alarm is cleared when the data directory status becomes normal. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38008 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+---------------------------------------------------------------------------+ +| Name | Meaning | ++===================+===========================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------------------------+ +| HostName | Specifies the host name for which the alarm is generated. | ++-------------------+---------------------------------------------------------------------------+ +| DirName | Specifies the directory name for which the alarm is generated. | ++-------------------+---------------------------------------------------------------------------+ +| Trigger Condition | Specifies the condition that the Kafka data directory status is abnormal. | ++-------------------+---------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the Kafka data directory status is abnormal, the current replicas of all partitions in the data directory are brought offline, and the data directory status of multiple nodes is abnormal at the same time. As a result, some partitions may become unavailable. + +Possible Causes +--------------- + +- The data directory permission is tampered with. +- The disk where the data directory is located is faulty. + +Procedure +--------- + +**Check the permission on the faulty data directory.** + +#. Find the host information in the alarm information and log in to the host. + +#. .. _alm-38008__li1654108315440: + + In the alarm information, check whether the data directory and its subdirectories belong to the omm:wheel group. + + - If yes, record the host name of the node and go to :ref:`4 `. + - If no, go to :ref:`3 `. + +#. .. _alm-38008__li1465202115440: + + Restore the owner group of the data directory and its subdirectories to omm:wheel. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`5 `. + +**Check whether the disk where the data directory is located is faulty.** + +4. .. _alm-38008__li7931254192720: + + In the upper-level directory of the data directory, create and delete files as user **omm**. Check whether data read/write on the disk is normal. + +5. .. _alm-38008__li4931105420275: + + Replace or repair the disk where the data directory is located to ensure that data read/write on the disk is normal. + +6. .. _alm-38008__li1893217544278: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka** > **Instance**. On the Kafka instance page that is displayed, restart the Broker instance on the host recorded in :ref:`2 `. + +7. After Broker is started, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Collect fault information.** + +8. .. _alm-38008__li783366415440: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +9. In the **Service** area, select **Kafka** in the required cluster. + +10. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417506.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38009_busy_broker_disk_i_os_applicable_to_versions_later_than_mrs_3.1.0.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38009_busy_broker_disk_i_os_applicable_to_versions_later_than_mrs_3.1.0.rst new file mode 100644 index 0000000..c90deee --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38009_busy_broker_disk_i_os_applicable_to_versions_later_than_mrs_3.1.0.rst @@ -0,0 +1,146 @@ +:original_name: ALM-38009.html + +.. _ALM-38009: + +ALM-38009 Busy Broker Disk I/Os (Applicable to Versions Later Than MRS 3.1.0) +============================================================================= + +.. note:: + + This section applies to versions later than MRS 3.1.0. + +Description +----------- + +The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the disk I/O of a Kafka data directory on a broker exceeds the threshold (80% by default). + +Its **Trigger Count** is **3**. This alarm is cleared when the disk I/O is lower than the threshold (80% by default). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +38009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------+ +| DataDirectoryName | Specifies the name of the Kafka data directory with frequent disk I/Os. | ++-------------------+-------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The disk partition has frequent I/Os. Data may fail to be written to the Kafka topic for which the alarm is generated. + +Possible Causes +--------------- + +- There are many replicas configured for the topic. +- The parameter for batch writing producer's messages is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate. + +Procedure +--------- + +**Check the number of topic replicas.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. Locate the row that contains this alarm, click |image1|, and view the host name in **Location**. + +#. On FusionInsight Manager, choose **Cluster**, click the name of the desired cluster, choose **Services** > **Kafka** > **KafkaTopic Monitor**, search for the topic for which the alarm is generated, and check the number of replicas. + +#. .. _alm-38009__li8398191175118: + + Reduce the replication factors of the topic (for example, reduce to **3**) if the number of replicas is greater than 3. + + Run the following command on the FusionInsight client to replan the replicas of Kafka topics: + + **kafka-reassign-partitions.sh** **--zookeeper** *{zk_host}:{port}*\ **/kafka** **--reassignment-json-file** *{manual assignment json file path}* **--execute** + + For example: + + **/opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh** **--zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka** **--reassignment-json-file expand-cluster-reassignment.json** **--execute** + + .. note:: + + In the **expand-cluster-reassignment.json** file, describe the brokers to which the partitions of the topic are migrated in the following format: {"partitions":[{"topic": "*topicName*","partition": 1,"replicas": [1,2,3] }],"version":1} + +#. Observe for a period of time and check whether the alarm is cleared. If the alarm persists, go to :ref:`5 `. + +**Check the partition planning of the topic.** + +5. .. _alm-38009__li15319131241119: + + On the **KafkaTopic Monitor** page, view **Topic Input Traffic** in the **Topic Traffic** area of each topic, obtain the topic with the largest value, and check the partitions of this topic as well as information about the host of these partitions. + +6. .. _alm-38009__li7320112121118: + + Log in to the host queried in :ref:`5 ` and run the **iostat -d -x** command to check the **%util** value of each disk. + + |image2| + + - If the **%util** value of each disk exceeds the threshold (**80%** by default), expand the Kafka disk capacity. After the capacity expansion, replan the topic partitions by referring to :ref:`3 `. + + - If the **%util** values of the disks vary greatly, check the disk partition configuration of Kafka. For example, check the value of **log.dirs** in the **${BIGDATA_HOME}/FusionInsight_HD\_8.1.0.1/1_14_Broker/etc/server.properties** file. + + Run the following command to view the **Filesystem** information: + + **df -h** *log.dirs value* + + The command output is as follows. + + |image3| + + - If the partition where Filesystem is located matches the partition with a high **%util** value, plan Kafka partitions on idle disks, configure **log.dirs** as an idle disk directory, and replan topic partitions by referring to :ref:`3 `. Ensure that the partitions of the topic are evenly distributed to each disk. + +7. Observe for a period of time and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, repeat :ref:`5 ` to :ref:`6 ` three times. Then, go to :ref:`8 `. + +8. .. _alm-38009__li1032011218115: + + Observe for a period of time and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect fault information.** + +9. .. _alm-38009__li1473912318017: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **Kafka** for the target cluster. + +11. Click |image4| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895771.png +.. |image2| image:: /_static/images/en-us_image_0000001441218685.png +.. |image3| image:: /_static/images/en-us_image_0000001441098753.png +.. |image4| image:: /_static/images/en-us_image_0263895859.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38010_topics_with_single_replica.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38010_topics_with_single_replica.rst new file mode 100644 index 0000000..61cf56f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-38010_topics_with_single_replica.rst @@ -0,0 +1,109 @@ +:original_name: ALM-38010.html + +.. _ALM-38010: + +ALM-38010 Topics with Single Replica +==================================== + +Description +----------- + +The system checks the number of replicas of each topic every 60 seconds on the node where the Kafka Controller resides. This alarm is generated when there is one replica for a topic. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +38010 Warning No +======== ============== ===================== + +Parameters +---------- + ++-------------+----------------------------------------------------------------+ +| Name | Meaning | ++=============+================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------+----------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+----------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+----------------------------------------------------------------+ +| TopicName | Specifies the list of topics for which the alarm is generated. | ++-------------+----------------------------------------------------------------+ + +Impact on the System +-------------------- + +There is the single point of failure (SPOF) risk for topics with only one replica. When the node where the replica resides becomes abnormal, the partition does not have a leader, and services on the topic are affected. + +Possible Causes +--------------- + +- The number of replicas for the topic is incorrectly configured. + +Procedure +--------- + +**Check the number of replicas for the topic.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**, click |image1| of this alarm, and view the **TopicName** list in **Location**. + +#. Check whether replicas need to be added for the topic for which the alarm is generated. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-38010__li135931325311: + + On the FusionInsight client, re-plan topic replicas and describe the partition distribution of the topic in the **add-replicas-reassignment.json** file in the following format: {"partitions":[{"topic": "*topic name*","partition": 1,"replicas": [1,2] }],"version":1}. Then, run the following command to add replicas: + + **kafka-reassign-partitions.sh** **--zookeeper** *{zk_host}:{port}*\ **/kafka** **--reassignment-json-file** *{manual assignment json file path}* **--execute** + + For example: + + **/opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 192.168.0.90:2181,192.168.0.91:2181,192.168.0.92:2181/kafka --reassignment-json-file add-replicas-reassignment.json --execute** + +#. Run the following command to check the task execution progress: + + **kafka-reassign-partitions.sh** **--zookeeper** *{zk_host}:{port}*\ **/kafka** **--reassignment-json-file** *{manual assignment json file path}* **--verify** + + For example: + + **/opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 192.168.0.90:2181,192.168.0.91:2181,192.168.0.92:2181/kafka --reassignment-json-file add-replicas-reassignment.json --verify** + +#. .. _alm-38010__li477744715546: + + After completing the handling operations or confirming that the alarm has no impact, manually clear the alarm on FusionInsight Manager. + +#. After a period of time, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-38010__li266761095919: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. In the **Service** area, select **Kafka** in the required cluster. + +9. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +If the alarm has no impact, manually clear the alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417510.png +.. |image2| image:: /_static/images/en-us_image_0000001125163135.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43001_spark2x_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43001_spark2x_service_unavailable.rst new file mode 100644 index 0000000..61cc1a3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43001_spark2x_service_unavailable.rst @@ -0,0 +1,115 @@ +:original_name: ALM-43001.html + +.. _ALM-43001: + +ALM-43001 Spark2x Service Unavailable +===================================== + +Description +----------- + +The system checks the Spark2x service status every 300 seconds. This alarm is generated when the Spark2x service is unavailable. + +This alarm is cleared when the Spark2x service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43001 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Spark tasks submitted by users fail to be executed. + +Possible Causes +--------------- + +- The KrbServer service is abnormal. +- The LdapServer service is abnormal. +- ZooKeeper is abnormal. +- HDFS is abnormal. +- Yarn is abnormal. +- The corresponding Hive service is abnormal. +- The Spark2x assembly package is abnormal. + +Procedure +--------- + +If the alarm is abnormal Spark2x assembly packet, the Spark packet is abnormal. Wait for about 10 minutes. The alarm is automatically cleared. + +**Check whether service unavailability alarms exist in services that Spark2x depends on.** + +#. On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Alarm** > **Alarms**. + +#. Check whether the following alarms exist in the alarm list: + + - ALM-25500 KrbServer Service Unavailable + - ALM-25000 LdapServer Service Unavailable + - ALM-13000 ZooKeeper Service Unavailable + - ALM-14000 HDFS Service Unavailable + - ALM-18000 Yarn Service Unavailable + - ALM-16004 Hive Service Unavailable + + .. note:: + + If the multi-instance function is enabled for the cluster and multiple Spark2x services are installed, check the Spark2x service for which the alarm is generated based on the value of **ServiceName** in location information and check whether the Hive service is faulty. Spark2x corresponds to Hive, spark2x1 corresponds to Hive1, and other services follow the same rule. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _alm-43001__l0ca5fe03ce10420c9ad6c90f8583a4bd: + + Handle the alarms based on the troubleshooting methods provided in the alarm help. + + After the alarm is cleared, wait a few minutes and check whether the alarm GuardianService Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect fault information.** + +4. .. _alm-43001__en-us_topic_0085589435_li3748337517: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. In the **Service** area, select the following nodes of the desired cluster. (Hive is the specific Hive service determined based on **ServiceName** in the alarm location information). + + - KrbServer + - LdapServer + - ZooKeeper + - HDFS + - Yarn + - Hive + +6. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0263895574.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..641e4ed --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43006.html + +.. _ALM-43006: + +ALM-43006 Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold +============================================================================= + +Description +----------- + +The system checks the JobHistory2x Process status every 30 seconds. The alarm is generated when the heap memory usage of a JobHistory2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JobHistory2x Process heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the JobHistory2x Process is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm** **> Alarms** and select the alarm whose **ID** is **43006**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JobHistory2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **Memory > JobHistory2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**. Check whether the used heap memory of the JobHistory2x Process reaches the threshold(default value is 95%) of the maximum heap memory specified for JobHistory2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43006__li11885161253: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JobHistory2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory > Statistics for the heap memory of the JobHistory2x Process**, and click **OK**. Based on the alarm generation time, check the values of the used heap memory of the JobHistory2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All Configurations**. Choose **JobHistory2x** > **Default**. The default value of **SPARK_DAEMON_MEMORY** is 4GB. You can change the value according to the following rules: Ratio of the maximum heap memory usage of the JobHistory2x to the **Threshold** of the **JobHistory2x Heap Memory Usage Statistics (JobHistory2x)** in the alarm period. If this alarm is generated occasionally after the parameter value is adjusted, increase the value by 0.5 times. If the alarm is frequently reported after the parameter value is adjusted, increase the value by 1 time. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** >\ **JobHistory2x Heap Memory Usage Statistics (JobHistory2x)** to view **Threshold**. + +#. Restart all JobHistory2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43006__li113872433153: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417534.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..c30d162 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43007.html + +.. _ALM-43007: + +ALM-43007 Non-Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold +================================================================================= + +Description +----------- + +The system checks the JobHistory2x Process status every 30 seconds. The alarm is generated when the non-heap memory usage of a JobHistory2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43007 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JobHistory2x Process non-heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The non-heap memory of the JobHistory2x Process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** and select the alarm whose **ID** is **43007**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JobHistory2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **Memory** > **JobHistory2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**, Check whether the used non-heap memory of the JobHistory2x Process reaches the threshold(default value is 95%) of the maximum non-heap memory specified for JobHistory2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43007__li1580615311553: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JobHistory2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory > Statistics for the** **non-heap** **memory of the JobHistory2x Process**, and click **OK**. Based on the alarm generation time, check the values of the used non-heap memory of the JobHistory2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All Configurations**. Choose **JobHistory2x** > **Default**. You can change the value of **-XX:MaxMetaspaceSize** in **SPARK_DAEMON_JAVA_OPTS** according to the following rules: Ratio of the JobHistory2x non-heap memory usage to the **Threshold** of **JobHistory2x** **Non-Heap** **Memory Usage Statistics (JobHistory2x)** in the alarm period. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** >\ **JobHistory2x** **Non-Heap** **Memory Usage Statistics (JobHistory2x)** to view **Threshold**. + +#. Restart all JobHistory2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43007__li18556111517300: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417535.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43008_the_direct_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43008_the_direct_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..4467eb2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43008_the_direct_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43008.html + +.. _ALM-43008: + +ALM-43008 The Direct Memory Usage of the JobHistory2x Process Exceeds the Threshold +=================================================================================== + +Description +----------- + +The system checks the JobHistory2x Process status every 30 seconds. The alarm is generated when the direct memory usage of a JobHistory2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43008 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JobHistory2x Process direct memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the JobHistory2x Process is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +**Check direct memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm** **> Alarms** and select the alarm whose **ID** is **43008**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JobHistory2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **Memory** > **JobHistory2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**. Check whether the used direct memory of the JobHistory2x Process reaches the threshold(default value is 95%) of the maximum direct memory specified for JobHistory2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43008__li1385194210583: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JobHistory2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory >** **Direct Memory of JobHistory2x**, and click **OK**. Based on the alarm generation time, check the values of the used direc memory of the JobHistory2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster > N**\ *ame of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All Configurations**. Choose **JobHistory2x** > **Default**. The default value of **-XX:MaxDirectMemorySize** in **SPARK_DAEMON_JAVA_OPTS** is 512 MB. You can change the value according to the following rules: Ratio of the maximum direct memory usage of the JobHistory2x to the **Threshold** of the **JobHistory2x** **Direct** **Memory Usage Statistics (JobHistory2x)** in the alarm period. If this alarm is generated occasionally after the parameter value is adjusted, increase the value by 0.5 times. If the alarm is frequently reported after the parameter value is adjusted, increase the value by 1 time. It is recommended that the value be less than or equal to the value of **SPARK_DAEMON_MEMORY**. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** >\ **JobHistory2x** **Direct** **Memory Usage Statistics (JobHistory2x)** to view **Threshold**. + +#. Restart all JobHistory2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43008__li1088894514319: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417537.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43009_jobhistory2x_process_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43009_jobhistory2x_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..43118f2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43009_jobhistory2x_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43009.html + +.. _ALM-43009: + +ALM-43009 JobHistory2x Process GC Time Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the garbage collection (GC) time of the JobHistory2x Process every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (exceeds 5 seconds for three consecutive checks.) To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC Time** > **Total GC time in milliseconds (JobHistory2x)**. This alarm is cleared when the JobHistory2x GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, JobHistory2x maybe run in low performance. + +Possible Causes +--------------- + +The memory of JobHistory2x is overused, the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarm\ s** and select the alarm whose **ID** is **43009**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JobHistory2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **GC Time** > **Garbage Collection (GC) Time of JobHistory2x** from the drop-down list box in the upper right corner and click **OK** to check whether the GC time is longer than the threshold(default value: 12 seconds). + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`6 `. + +#. .. _alm-43009__li16285182113329: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All Configurations**. Choose **JobHistory2x** > **Default**. The default value of **SPARK_DAEMON_MEMORY** is 4GB. You can change the value according to the following rules: If this alarm is generated occasionally, increase the value by 0.5 times. If the alarm is frequently reported, increase the value by 1 time. + +#. Restart all JobHistory2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-43009__li81551125133212: + + On the FusionInsight Manager interface of active and standby clusters, choose **O&M** > **Log > Download**. + +7. Select **Spark2x** in the required cluster from the **Service**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417538.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..be41c7d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43010.html + +.. _ALM-43010: + +ALM-43010 Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold +============================================================================= + +Description +----------- + +The system checks the JDBCServer2x Process status every 30 seconds. The alarm is generated when the heap memory usage of a JDBCServer2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43010 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JDBCServer2x Process heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the JDBCServer2x Process is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarm\ s** and select the alarm whose **ID** is **43010**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JDBCServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **Memory** > **JDBCServer2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**. Check whether the used heap memory of the JDBCServer2x Process reaches the threshold(default value is 95%) of the maximum heap memory specified for JDBCServer2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43010__li11885161253: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JDBCServer2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory > Statistics for the heap memory of the JDBCServer2x Process**, and click **OK**. Based on the alarm generation time, check the values of the used heap memory of the JDBCServer2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All Configurations**. Choose **JDBCServer2x** > **Tuning**. The default value of **SPARK_DRIVER_MEMORY** is 4 GB. You can change the value according to the following rules: Ratio of the maximum heap memory usage of the JobHistory2x to the **Threshold** of the **JDBCServer2x Heap Memory Usage Statistics (JDBCServer2x)** in the alarm period. If this alarm is generated occasionally after the parameter value is adjusted, increase the value by 0.5 times. If the alarm is frequently reported after the parameter value is adjusted, increase the value by 1 time. In the case of large service volume and high concurrency, add instances. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** >\ **JDBCServer2x Heap Memory Usage Statistics (JDBCServer2x)** to view **Threshold**. + +#. Restart all JDBCServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43010__li11492144593210: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417539.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..fd13a13 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43011.html + +.. _ALM-43011: + +ALM-43011 Non-Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold +================================================================================= + +Description +----------- + +The system checks the JDBCServer2x Process status every 30 seconds. The alarm is generated when the non-heap memory usage of an JDBCServer2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43011 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JDBCServer2x Process non-heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The non-heap memory of the JDBCServer2x Process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarm\ s** and select the alarm whose **ID** is **43011**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JDBCServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **Memory** > **JDBCServer2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**, Check whether the used non-heap memory of the JDBCServer2x Process reaches the threshold(default value is 95%) of the maximum non-heap memory specified for JDBCServer2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43011__li1580615311553: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JDBCServer2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory > Statistics for the** **non-heap** **memory of the JDBCServer2x Process**, and click **OK**. Based on the alarm generation time, check the values of the used non-heap memory of the JDBCServer2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All** **Configurations**. Choose **JDBCServer2x** > **Tuning**. You can change the value of **-XX: MaxMetaspaceSize** in **spark.driver.extraJavaOptions** according to the following rules: Ratio of the JDBCServer2x non-heap memory usage to the **Threshold** of **JDBCServer2x** **Non-Heap** **Memory Usage Statistics ( JDBCServer2x)** in the alarm period. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** > **JDBCServer2x** **Non-Heap** **Memory Usage Statistics (JDBCServer2x)** to view **Threshold**. + +#. Restart all JDBCServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43011__li733812157332: + + On the FusionInsight Manager portal, choose **O&M** > **Log >Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417540.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43012_direct_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43012_direct_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..03f305e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43012_direct_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43012.html + +.. _ALM-43012: + +ALM-43012 Direct Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold +==================================================================================== + +Description +----------- + +The system checks the JDBCServer2x Process status every 30 seconds. The alarm is generated when the direct heap memory usage of a JDBCServer2x Process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43012 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+==============================================================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated. | ++-------------------+------------------------------------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available JDBCServer2x Process direct heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct heap memory of the JDBCServer2x Process is overused or the direct heap memory is inappropriately allocated. + +Procedure +--------- + +**Check direct heap memory usage.** + +#. On the FusionInsight Manager portal, choose **O&M > Alarm > Alarms** and select the alarm whose **ID** is **43012**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JDBCServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > Memory > **JDBCServer2x Memory Usage Statistics** from the drop-down list box in the upper right corner and click **OK**. Check whether the used direct heap memory of the JDBCServer2x Process reaches the threshold(default value is 95%) of the maximum direct heap memory specified for JDBCServer2x. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`7 `. + +#. .. _alm-43012__li392720233195: + + On the FusionInsight Manager home page, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click **JDBCServer2x** by which the alarm is reported to go to the **Dashboard** page, click the drop-down list in the upper right corner of the chart area, choose **Customize** > **Memory >** **Direct Memory of JDBCServer2x**, and click **OK**. Based on the alarm generation time, check the values of the used direct memory of the JDBCServer2x process in the corresponding period and obtain the maximum value. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All** **Configurations**. Choose **JDBCServer2x** > **Tuning**. The default value of **-XX:MaxDirectMemorySize** in **spark.driver.extraJavaOptions** is 512 MB. You can change the value according to the following rules: Ratio of the maximum direct memory usage of the JDBCServer2x to the **Threshold** of the **JDBCServer2x** **Direct** **Memory Usage Statistics (JDBCServer2x)** in the alarm period. If this alarm is generated occasionally after the parameter value is adjusted, increase the value by 0.5 times. If the alarm is frequently reported after the parameter value is adjusted, increase the value by 1 time. In the case of large service volume and high service concurrency, you are advised to add instances. + + .. note:: + + On the FusionInsight Manager home page, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* **>** **Spark2x** > **Memory** >\ **JDBCServer2x** **Direct** **Memory Usage Statistics (JDBCServer2x)** to view **Threshold**. + +#. Restart all JDBCServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43012__li11472516103510: + + On the FusionInsight Manager portal, choose **O&M** > **Log > Download**. + +8. Select **Spark2x** in the required cluster from the **Service**. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +10. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417541.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43013_jdbcserver2x_process_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43013_jdbcserver2x_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..af715fa --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43013_jdbcserver2x_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43013.html + +.. _ALM-43013: + +ALM-43013 JDBCServer2x Process GC Time Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the garbage collection (GC) time of the JDBCServer2x Process every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (exceeds 5 seconds for three consecutive checks.) To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC Time** > **Total GC time in milliseconds (JDBCServer2x)**. This alarm is cleared when the JDBCServer2x GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43013 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Name | Meaning | ++===================+=====================================================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| ServiceName | Specifies the service name for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role name for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the object (host ID) for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, JDBCServer2x maybe run in low performance. + +Possible Causes +--------------- + +The memory of JDBCServer2x is overused, the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On the FusionInsight Manager portal, choose **O&M** **> Alarm** **> Alarms** and select the alarm whose **ID** is **43013**. Check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Instance** and click the JDBCServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **GC Time** > **Garbage Collection (GC) Time of JDBCServer2x** from the drop-down list box in the upper right corner and click **OK** to check whether the GC time is longer than the threshold(default value: 12 seconds). + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`6 `. + +#. .. _alm-43013__li187231820103612: + + On the FusionInsight Manager portal, choose **Cluster >** *Name of the desired cluster* **> Services** > **Spark2x** > **Configurations**, and click **All** **Configurations**. Choose **JDBCServer2x** > **Default**. The default value of **SPARK_DRIVER_MEMORY** is 4 GB. You can change the value according to the following rules: If this alarm is generated occasionally, increase the value by 0.5 times. If the alarm is frequently reported, increase the value by 1 time. In the case of large service volume and high service concurrency, you are advised to add instances. + +#. Restart all JDBCServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-43013__li1560184215369: + + On the FusionInsight Manager interface of active and standby clusters, choose **O&M** > **Log > Download**. + +7. Select **Spark2x** in the required cluster from the **Service**. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417542.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43017_jdbcserver2x_process_full_gc_number_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43017_jdbcserver2x_process_full_gc_number_exceeds_the_threshold.rst new file mode 100644 index 0000000..514ac1d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43017_jdbcserver2x_process_full_gc_number_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43017.html + +.. _ALM-43017: + +ALM-43017 JDBCServer2x Process Full GC Number Exceeds the Threshold +=================================================================== + +Description +----------- + +The system checks the number of Full garbage collection (GC) times of the JDBCServer2x process every 60 seconds. This alarm is generated when the detected Full GC number exceeds the threshold (exceeds 12 for three consecutive checks.) You can change the threshold by choosing **O&M > Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC number** > **Full GC Number of JDBCServer2x**. This alarm is cleared when the Full GC number of the JDBCServer2x process is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43017 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The performance of the JDBCServer2x process is affected, or even the JDBCServer2x process is unavailable. + +Possible Causes +--------------- + +The heap memory usage of the JDBCServer2x process is excessively large, or the heap memory is inappropriately allocated. As a result, Full GC occurs frequently. + +Procedure +--------- + +**Check the number of Full GCs.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**, select this alarm, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. On the displayed page, click the JDBCServer2x for which the alarm is reported. On the **Dashboard** page that is displayed, click the drop-down menu in the Chart area and choose **Customize** > **GC Number** > **Full GC Number of JDBCServer2x** in the upper right corner and click **OK**. Check whether the number of Full GCs of the JDBCServer2x process is greater than the threshold(default value: 12). + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`6 `. + +#. .. _alm-43017__li67301951054: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations**. On the displayed page, choose **JDBCServer2x** > **Tuning**. The default value of **SPARK_DRIVER_MEMORY** is 4GB. You can change the value according to the following rules: If this alarm is generated occasionally, increase the value by 0.5 times. If the alarm is frequently reported, increase the value by 1 time. In the case of large service volume and high concurrency, add instances. + +#. Restart all JDBCServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-43017__li8341010966: + + Log in to FusionInsight Manager, and choose **O&M** > **Log** > **Download**. + +7. Select **Spark2x** in the required cluster from the **Service** drop-down list. + +8. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +9. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417543.gif diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43018_jobhistory2x_process_full_gc_number_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43018_jobhistory2x_process_full_gc_number_exceeds_the_threshold.rst new file mode 100644 index 0000000..61f4043 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43018_jobhistory2x_process_full_gc_number_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43018.html + +.. _ALM-43018: + +ALM-43018 JobHistory2x Process Full GC Number Exceeds the Threshold +=================================================================== + +Description +----------- + +The system checks the number of Full garbage collection (GC) times of the JobHistory2x process every 60 seconds. This alarm is generated when the detected Full GC number exceeds the threshold (exceeds 12 for three consecutive checks.) You can change the threshold by choosing **O&M** > **Alarm > Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC number** > **Full GC Number of JobHistory2x**. This alarm is cleared when the Full GC number of the JobHistory2x process is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +43018 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The performance of the JobHistory2x process is affected, or even the JobHistory2x process is unavailable. + +Possible Causes +--------------- + +The heap memory usage of the JobHistory2x process is excessively large, or the heap memory is inappropriately allocated. As a result, Full GC occurs frequently. + +Procedure +--------- + +**Check the number of Full GCs.** + +#. Log in to FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**, select this alarm, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. On the displayed page, click the JobHistory2x for which the alarm is reported. On the **Dashboard** page that is displayed, click the drop-down menu in the Chart area and choose **Customize** > **GC Number** > **Full GC Number of JobHistory2x** in the upper right corner and click **OK**. Check whether the number of Full GCs of the JobHistory2x process is greater than the threshold(default value: 12). + + - If it is, go to :ref:`3 `. + - If it is not, go to :ref:`6 `. + +#. .. _alm-43018__li15899121475: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations**. On the displayed page, choose **JobHistory2x** > **Default.** The default value of **SPARK_DAEMON_MEMORY** is 4GB. You can change the value according to the following rules: If this alarm is generated occasionally, increase the value by 0.5 times. If the alarm is frequently reported, increase the value by 1 time. + +#. Restart all JobHistory2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If it is, no further action is required. + - If it is not, go to :ref:`6 `. + + **Collect fault information.** + +#. .. _alm-43018__li159003211673: + + Log in to FusionInsight Manager, and choose **O&M** > **Log** > **Download**. + +#. Select **Spark2x** in the required cluster from the **Service**. + +#. Click |image1| in the upper right corner. In the displayed dialog box, set **Start Date** and **End Date** to 10 minutes before and after the alarm generation time respectively and click **OK**. Then, click **Download**. + +#. Contact the O&M personnel and send the collected logs. + +Alarm Clearing +-------------- + +This alarm will be automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417544.gif diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43019_heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43019_heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..703dea1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43019_heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43019.html + +.. _ALM-43019: + +ALM-43019 Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold +============================================================================== + +Description +----------- + +The system checks the IndexServer2x process status every 30 seconds. The alarm is generated when the heap memory usage of a IndexServer2x process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ======== ========== +Alarm ID Severity Auto Clear +======== ======== ========== +43019 Major Yes +======== ======== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available IndexServer2x process heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the IndexServer2x process is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**. In the displayed alarm list, choose the alarm for which the ID is **43019**, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory** > **IndexServer2x Memory Usage Statistics** > **OK**. Check whether the heap memory used by the IndexServer2x process reaches the maximum heap memory threshold (95% by default). + + - If the threshold is reached, go to :ref:`3 `. + - If the threshold is not reached, go to :ref:`7 `. + +#. .. _alm-43019__li1769491023514: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory** > **Statistics for the heap memory of the IndexServer2x Process** > **OK**. Based on the alarm generation time, check the values of the used heap memory of the IndexServer2x process in the corresponding period and obtain the maximum value. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configuration** > **IndexServer2x**> **Tuning**. The default value of the **SPARK_DRIVER_MEMORY** parameter is 4 GB. You can change the value based on the ratio of the maximum heap memory used by the IndexServer2x process to the threshold specified by **IndexServer2x Heap Memory Usage Statistics (IndexServer2x)** in the alarm period. If the alarm persists after the parameter value is changed, increase the value by 0.5 times. If the alarm is generated frequently, double the rate. + + .. note:: + + On FusionInsight Manager, you can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **Memory** > **IndexServer2x Heap Memory Usage Statistics (IndexServer2x)** to view the threshold. + +#. Restart all IndexServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm is not cleared, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43019__li1543255925610: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Expand the **Service** drop-down list, and select **Spark2x** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact the O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417545.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43020_non-heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43020_non-heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..c12371a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43020_non-heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43020.html + +.. _ALM-43020: + +ALM-43020 Non-Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold +================================================================================== + +Description +----------- + +The system checks the IndexServer2x process status every 30 seconds. The alarm is generated when the non-heap memory usage of the IndexServer2x process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ======== ========== +Alarm ID Severity Auto Clear +======== ======== ========== +43020 Major Yes +======== ======== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available IndexServer2x process non-heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The non-heap memory of the IndexServer2x process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**. In the displayed alarm list, choose the alarm for which the ID is **43020**, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory > IndexServer2x Memory Usage Statistics** > **OK**. Check whether the non-heap memory used by the IndexServer2x process reaches the maximum non-heap memory threshold (95% by default). + + - If the threshold is reached, go to :ref:`3 `. + - If the threshold is not reached, go to :ref:`7 `. + +#. .. _alm-43020__li311482053120: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory** > **Statistics for the non-heap memory of the IndexServer2x Process** > **OK**. Based on the alarm generation time, check the values of the used non-heap memory of the IndexServer2x process in the corresponding period and obtain the maximum value. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations** > **IndexServer2x**> **Tuning**. You can change the value of **XX:MaxMetaspaceSize** in the **spark.driver.extraJavaOptions** parameter based on the ratio of the maximum non-heap memory used by the IndexServer2x process to the threshold specified by **IndexServer2x Non-Heap Memory Usage Statistics (IndexServer2x)** in the alarm period. + + .. note:: + + On FusionInsight Manager, you can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **Memory** > **IndexServer2x Non-Heap Memory Usage Statistics (IndexServer2x)** to view the threshold. + +#. Restart all IndexServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm is not cleared, go to :ref:`7 `. + +**Collect fault information**. + +7. .. _alm-43020__li141131720123116: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Expand the **Service** drop-down list, and select **Spark2x** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact the O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417546.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43021_direct_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43021_direct_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..acb3a29 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43021_direct_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-43021.html + +.. _ALM-43021: + +ALM-43021 Direct Memory Usage of the IndexServer2x Process Exceeds the Threshold +================================================================================ + +Description +----------- + +The system checks the IndexServer2x process status every 30 seconds. The alarm is generated when the direct heap memory usage of a IndexServer2x process exceeds the threshold (95% of the maximum memory). + +Attribute +--------- + +======== ======== ========== +Alarm ID Severity Auto Clear +======== ======== ========== +43021 Major Yes +======== ======== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available IndexServer2x process direct memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct heap memory of the IndexServer2x process is overused or the direct heap memory is inappropriately allocated. + +Procedure +--------- + +**Check direct heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**. In the displayed alarm list, choose the alarm for which the ID is **43021**, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory** > **IndexServer2x Memory Usage Statistics** > **OK**. Check whether the direct memory used by the IndexServer2x process reaches the maximum direct memory threshold. + + - If the threshold is reached, go to :ref:`3 `. + - If the threshold is not reached, go to :ref:`7 `. + +#. .. _alm-43021__li141321031113812: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance**. Click the IndexServer2x that reported the alarm to go to the **Dashboard** page. Click the drop-down list in the upper right corner of the chart area, and choose **Customize** > **Memory** > **Direct Memory of IndexServer2x** > **OK**. Based on the alarm generation time, check the values of the used direct memory of the IndexServer2x process in the corresponding period and obtain the maximum value. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations** > **IndexServer2x**> **Tuning**. You can change the value of **XX:MaxDirectMemorySize** (the default value is 512 MB) in the **spark.driver.extraJavaOptions** parameter based on the ratio of the maximum direct memory used by the IndexServer2x process to the threshold specified by **IndexServer2x Direct Memory Usage Statistics (IndexServer2x)** in the alarm period. If the alarm persists after the parameter value is changed, increase the value by 0.5 times. If the alarm is generated frequently, double the rate. + + .. note:: + + On FusionInsight Manager, you can choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **Memory** > **IndexServer2x Direct Memory Usage Statistics (IndexServer2x)** to view the threshold. + +#. Restart all IndexServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm is not cleared, go to :ref:`7 `. + +**Collect fault information.** + +7. .. _alm-43021__li181301231123812: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +8. Expand the **Service** drop-down list, and select **Spark2x** for the target cluster. + +9. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +10. Contact the O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417547.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43022_indexserver2x_process_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43022_indexserver2x_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..ac62b4e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43022_indexserver2x_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43022.html + +.. _ALM-43022: + +ALM-43022 IndexServer2x Process GC Time Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the GC time of the IndexServer2x process every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (12 seconds) for three consecutive times. To change the threshold, choose **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC Time** > **Total GC time in milliseconds (IndexServer2x)**. This alarm is cleared when the IndexServer2x GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ======== ========== +Alarm ID Severity Auto Clear +======== ======== ========== +43022 Major Yes +======== ======== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, IndexServer2x may run in low performance or even unavailable. + +Possible Causes +--------------- + +The heap memory of the IndexServer2x process is overused or the heap memory is inappropriately allocated. As a result, GC occurs frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**. In the displayed alarm list, choose the alarm with ID **43022**, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance** and click the IndexServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the Chart area and choose **Customize** > **GC Time** > **Garbage Collection (GC) Time of IndexServer2x** from the drop-down list box in the upper right corner and click **OK** to check whether the GC time is longer than the threshold (default value: 12 seconds). + + - If the threshold is reached, go to :ref:`3 `. + - If the threshold is not reached, go to :ref:`6 `. + +#. .. _alm-43022__li928810235414: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations** > **IndexServer2x** > **Default**. The default value of the **SPARK_DRIVER_MEMORY** is 4 GB. You can change the value according to the following rules: Increase the value of the **SPARK_DRIVER_MEMORY** parameter 1.5 times to its default value. If this alarm is still generated occasionally after the adjustment, increase the value by 0.5 times. Double the value if the alarm is reported frequently. + +#. Restart all IndexServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm is not cleared, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-43022__li728614235411: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Spark2x** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +9. Contact the O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417548.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43023_indexserver2x_process_full_gc_number_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43023_indexserver2x_process_full_gc_number_exceeds_the_threshold.rst new file mode 100644 index 0000000..5464335 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-43023_indexserver2x_process_full_gc_number_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-43023.html + +.. _ALM-43023: + +ALM-43023 IndexServer2x Process Full GC Number Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the Full GC number of the IndexServer2x process every 60 seconds. This alarm is generated when the detected Full GC number exceeds the threshold (12) for three consecutive times. You can change the threshold by choosing **O&M** > **Alarm** > **Thresholds** > *Name of the desired cluster* > **Spark2x** > **GC Number** > **Full GC Number of IndexServer2x**. This alarm is cleared when the Full GC number of the IndexServer2x process is less than or equal to the threshold. This alarm is cleared when the Full GC number of the IndexServer2x process is less than or equal to the threshold. + +Attribute +--------- + +======== ======== ========== +Alarm ID Severity Auto Clear +======== ======== ========== +43023 Major Yes +======== ======== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the GC number exceeds the threshold, IndexServer2x maybe run in low performance or even unavailable. + +Possible Causes +--------------- + +The heap memory of the IndexServer2x process is overused or the heap memory is inappropriately allocated. As a result, Full GC occurs frequently. + +Procedure +--------- + +**Check the number of Full GCs.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** **> Alarms**. In the displayed alarm list, choose the alarm with the ID **43023**, and check the **RoleName** in **Location** and confirm the IP address of **HostName**. + +#. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Instance** and click the IndexServer2x for which the alarm is generated to go to the **Dashboard** page. Click the drop-down menu in the chart area and choose **Customize** > **GC Number > Full GC Number of IndexServer2x** from the drop-down list box in the upper right corner and click **OK** to check whether the GC number is larger than the threshold (default value: 12). + + - If the threshold is reached, go to :ref:`3 `. + - If the threshold is not reached, go to :ref:`6 `. + +#. .. _alm-43023__li150175319432: + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Spark2x** > **Configurations** > **All Configurations** > **IndexServer2x** > **Tuning**. The default value of the **SPARK_DRIVER_MEMORY** is 4 GB. You can change the value according to the following rules: If this alarm is generated occasionally, increase the value by 0.5 times. Double the value if the alarm is reported frequently. In the case of large service volume and high service concurrency, you are advised to add instances. + +#. Restart all IndexServer2x instances. + +#. After 10 minutes, check whether the alarm is cleared. + + - If the alarm is cleared, no further action is required. + - If the alarm is not cleared, go to :ref:`6 `. + +**Collect fault information.** + +6. .. _alm-43023__li79972052174314: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Spark2x** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +9. Contact the O&M personnel and send the collected fault logs. + +Alarm Clearing +-------------- + +After the fault is rectified, the system automatically clears this alarm. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0269417549.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst new file mode 100644 index 0000000..248ae95 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst @@ -0,0 +1,63 @@ +:original_name: ALM-44004.html + +.. _ALM-44004: + +ALM-44004 Presto Coordinator Resource Group Queuing Tasks Exceed the Threshold +============================================================================== + +Description +----------- + +This alarm is generated when the system detects that the number of queuing tasks in a resource group exceeds the threshold. The system queries the number of queuing tasks in a resource group through the JMX interface. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Presto** > **resource-groups** to configure a resource group. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Coordinator** > **Customize** > **resourceGroupAlarm** to configure the threshold of each resource group. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44004 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the number of queuing tasks in a resource group exceeds the threshold, a large number of tasks may be in the queuing state. The Presto task time exceeds the expected value. When the number of queuing tasks in a resource group exceeds the maximum number (**maxQueued**) of queuing tasks in the resource group, new tasks cannot be executed. + +Possible Causes +--------------- + +The resource group configuration is improper or too many tasks in the resource group are submitted. + +Procedure +--------- + +#. Choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Presto** > **resource-groups** to adjust the resource group configuration. +#. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Coordinator** > **Customize** > **resourceGroupAlarm** to modify the threshold of each resource group. +#. Collect the fault information. + + a. Log in to the cluster node based on the host name in the fault information and query the number of queuing tasks based on **Resource Group** in the additional information on the Presto client. + + b. Log in to the cluster node based on the host name in the fault information, view the **/var/log/Bigdata/nodeagent/monitorlog/monitor.log** file, and search for resource group information to view the monitoring collection information of the resource group. + + c. Call the OTC Customer Hotline for support. + + Germany: 0800 330 44 44 + + International: +800 44556600 + +Related Information +------------------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..6c7f6a1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,87 @@ +:original_name: ALM-44005.html + +.. _ALM-44005: + +ALM-44005 Presto Coordinator Process GC Time Exceeds the Threshold +================================================================== + +Description +----------- + +The system collects GC time of the Presto Coordinator process every 30 seconds. This alarm is generated when the GC time exceeds the threshold (exceeds 5 seconds for three consecutive times). You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Presto** > **Coordinator** > **Presto Process Garbage Collection Time** > **Garbage Collection Time of the Coordinator Process** on MRS Manager. This alarm is cleared when the Coordinator process GC time is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44005 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the GC time of the Coordinator process is too long, the Coordinator process running performance will be affected and the Coordinator process will even be unavailable. + +Possible Causes +--------------- + +The heap memory of the Coordinator process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + .. note:: + + For MRS 1.8.10 or earlier, log in to MRS Manager and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **44005** and then check the role name in **Location** and confirm the IP adress of the instance. + + c. Choose **Components** > **Presto** > **Instances** > **Coordinator** (business IP address of the instance for which the alarm is generated) > **Customize** > **Presto Garbage Collection Time**. Click **OK** to view the GC time. + + d. Check whether the GC time of the Coordinator process is longer than 5 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm-44005__en-us_topic_0225312712_li1011493181634: + + Choose **Components** > **Presto** > **Service Configuration**, and switch **Basic** to **All**. Choose **Presto** > **Coordinator**. Increase the value of **-Xmx** (maximum heap memory) in the **JAVA_OPTS** parameter based on the site requirements. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm-44005__en-us_topic_0225312712_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + + b. Call the OTC Customer Hotline for support. + + Germany: 0800 330 44 44 + + International: +800 44556600 + +Reference +--------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..b65fd59 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,87 @@ +:original_name: ALM-44006.html + +.. _ALM-44006: + +ALM-44006 Presto Worker Process GC Time Exceeds the Threshold +============================================================= + +Description +----------- + +The system collects GC time of the Presto Worker process every 30 seconds. This alarm is generated when the GC time exceeds the threshold (exceeds 5 seconds for three consecutive times). You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Presto** > **Worker** > **Presto Garbage Collection Time** > **Garbage Collection Time of the Worker Process** on MRS Manager. This alarm is cleared when the Worker process GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44006 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the GC time of the Worker process is too long, the Worker process running performance will be affected and the Worker process will even be unavailable. + +Possible Causes +--------------- + +The heap memory of the Worker process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + .. note:: + + For MRS 1.8.10 or earlier, log in to MRS Manager and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **44006**. Then check the role name in **Location** and confirm the IP adress of the instance. + + c. Choose **Components** > **Presto** > **Instances** > **Worker** (business IP address of the instance for which the alarm is generated) > **Customize** > **Presto Garbage Collection Time**. Click **OK** to view the GC time. + + d. Check whether the GC time of the Worker process is longer than 5 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm-44006__en-us_topic_0225312713_li3841416113916: + + Choose **Components** > **Presto** > **Service Configuration**, and switch **Basic** to **All**, and choose **Presto** > **Worker** Increase the value of **-Xmx** (maximum heap memory) in the **JAVA_OPTS** parameter based on the site requirements. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm-44006__en-us_topic_0225312713_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + + b. Call the OTC Customer Hotline for support. + + Germany: 0800 330 44 44 + + International: +800 44556600 + +Reference +--------- + +None diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45175_average_time_for_calling_obs_metadata_apis_is_greater_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45175_average_time_for_calling_obs_metadata_apis_is_greater_than_the_threshold.rst new file mode 100644 index 0000000..f95f56b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45175_average_time_for_calling_obs_metadata_apis_is_greater_than_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-45175.html + +.. _ALM-45175: + +ALM-45175 Average Time for Calling OBS Metadata APIs Is Greater than the Threshold +================================================================================== + +Description +----------- + +The system checks whether the average duration for calling OBS metadata APIs is greater than the threshold every 30 seconds. This alarm is generated when the number of consecutive times that the average time exceeds the specified threshold is greater than the number of smoothing times. + +This alarm is automatically cleared when the average duration for calling the OBS metadata APIs is lower than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45175 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the average time for calling the OBS metadata APIs exceeds the threshold, the upper-layer big data computing services may be affected. To be more specific, the execution time of some computing tasks will exceed the threshold. + +Possible Causes +--------------- + +Frame freezing occurs on the OBS server, or the network between the OBS client and the OBS server is unstable. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the **FusionInsight Manager** homepage, choose **O&M** > **Alarm** > **Alarms** > **Average Time for Calling the OBS Metadata API Exceeds the Threshold**, view the role name in **Location**, and check the instance IP address. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **meta** > **Instance** > **meta** (IP address of the instance for which the alarm is generated). Click the drop-down list in the upper right corner of the chart area and choose **Customize**. In the dialog box that is displayed, select **Average time of OBS interface calls** from **OBS Meta data Operations**, and click **OK**. Check whether the average time of OBS metadata API calls exceeds the threshold. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-45175__li5868113155910: + + Choose **Cluster** > *Name of the desired cluster* > **O&M** > **Alarm** > **Thresholds** > **meta** > **Average Time for Calling the OBS Metadata API**. Increase the threshold or smoothing times as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect the fault information.** + +5. .. _alm-45175__li4749473185459: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. In the **Services** area, select **NodeAgent**, **NodeMetricAgent**, **OmmServer**, and **OmmAgent** under OMS. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0276137859.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45176_success_rate_of_calling_obs_metadata_apis_is_lower_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45176_success_rate_of_calling_obs_metadata_apis_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..ca973c1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45176_success_rate_of_calling_obs_metadata_apis_is_lower_than_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-45176.html + +.. _ALM-45176: + +ALM-45176 Success Rate of Calling OBS Metadata APIs Is Lower than the Threshold +=============================================================================== + +Description +----------- + +The system checks whether the success rate of calling OBS metadata APIs is lower than the threshold every 30 seconds. This alarm is generated when the success rate is lower than the threshold. + +This alarm is automatically cleared when the success rate of calling APIs for writing OBS data is greater than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45176 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the success rate of calling the OBS metadata APIs is less than the threshold, the upper-layer big data computing services may be affected. To be more specific, some computing tasks may fail to be executed. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the **FusionInsight Manager** homepage, choose **O&M** > **Alarm** > **Alarms** > **Success Rate for Calling the OBS Metadata API Is Lower Than the Threshold**, view the role name in **Location**, and check the instance IP address. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **meta** > **Instance** > **meta** (IP address of the instance for which the alarm is generated). Click the drop-down list in the upper right corner of the chart area and choose **Customize**. In the dialog box that is displayed, select **Success percent of OBS interface calls** from **OBS Meta data Operations**, and click **OK**. Check whether the average time of OBS metadata API calls exceeds the threshold. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-45176__li5868113155910: + + Choose **Cluster** > *Name of the desired cluster* > **O&M** > **Alarm** > **Thresholds** > **meta** > **Success Rate for Calling the OBS Metadata API**. Increase the threshold or smoothing times as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect the fault information.** + +5. .. _alm-45176__li4749473185459: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. In the **Services** area, select **NodeAgent**, **NodeMetricAgent**, **OmmServer**, and **OmmAgent** under OMS. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0276137858.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45177_success_rate_of_calling_obs_data_read_apis_is_lower_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45177_success_rate_of_calling_obs_data_read_apis_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..08a4cda --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45177_success_rate_of_calling_obs_data_read_apis_is_lower_than_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-45177.html + +.. _ALM-45177: + +ALM-45177 Success Rate of Calling OBS Data Read APIs Is Lower than the Threshold +================================================================================ + +Description +----------- + +The system checks whether the success rate of calling APIs for reading OBS data is lower than the threshold every 30 seconds. This alarm is generated when the success rate is lower than the threshold. + +This alarm is automatically cleared when the success rate of calling APIs for reading OBS data is greater than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45177 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the success rate of calling the OBS APIs for reading data is less than the threshold, the upper-layer big data computing services may be affected. To be more specific, some computing tasks may fail to be executed. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the **FusionInsight Manager** homepage, choose **O&M** > **Alarm** > **Alarms** > **Success Rate for Calling the OBS Data Read API Is Lower Than the Threshold**, view the role name in **Location**, and check the instance IP address. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **meta** > **Instance** > **meta** (IP address of the instance for which the alarm is generated). Click the drop-down list in the upper right corner of the chart area and choose **Customize**. In the dialog box that is displayed, select **Success percent of OBS data read operation interface calls** from **OBS data read operation**, and click **OK**. Check whether the average time of OBS metadata API calls exceeds the threshold. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-45177__li5868113155910: + + Choose **Cluster** > *Name of the desired cluster* > **O&M** > **Alarm** > **Thresholds** > **meta** > **Success Rate for Calling the OBS Data Read API**. Increase the threshold or smoothing times as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect the fault information.** + +5. .. _alm-45177__li4749473185459: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. In the **Services** area, select **NodeAgent**, **NodeMetricAgent**, **OmmServer**, and **OmmAgent** under OMS. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0276137857.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45178_success_rate_of_calling_obs_data_write_apis_is_lower_than_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45178_success_rate_of_calling_obs_data_write_apis_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..0344d38 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45178_success_rate_of_calling_obs_data_write_apis_is_lower_than_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-45178.html + +.. _ALM-45178: + +ALM-45178 Success Rate of Calling OBS Data Write APIs Is Lower Than the Threshold +================================================================================= + +Description +----------- + +The system checks whether the success rate of calling APIs for writing OBS data is lower than the threshold every 30 seconds. This alarm is generated when the success rate is lower than the threshold. + +This alarm is automatically cleared when the success rate of calling APIs for writing OBS data is greater than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45178 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +If the success rate of calling the OBS APIs for writing data is lower than the threshold, the upper-layer big data computing services may be affected. To be more specific, some computing tasks may fail to be executed. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On the **FusionInsight Manager** homepage, choose **O&M** > **Alarm** > **Alarms** > **Success Rate for Calling the OBS Data Write API Is Lower Than the Threshold**, view the role name in **Location**, and check the instance IP address. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **meta** > **Instance** > **meta** (IP address of the instance for which the alarm is generated). Click the drop-down list in the upper right corner of the chart area and choose **Customize**. In the dialog box that is displayed, select **Success percent of OBS data write operation interface calls** from **OBS data write operation**, and click **OK**. Check whether the average time of OBS metadata API calls exceeds the threshold. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm-45178__li5868113155910: + + Choose **Cluster** > *Name of the desired cluster* > **O&M** > **Alarm** > **Thresholds** > **meta** > **Success Rate for Calling the OBS Data Write API**. Increase the threshold or smoothing times as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect the fault information.** + +5. .. _alm-45178__li4749473185459: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +6. In the **Services** area, select **NodeAgent**, **NodeMetricAgent**, **OmmServer**, and **OmmAgent** under OMS. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0276137853.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45179_number_of_failed_obs_readfully_api_calls_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45179_number_of_failed_obs_readfully_api_calls_exceeds_the_threshold.rst new file mode 100644 index 0000000..21f3318 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45179_number_of_failed_obs_readfully_api_calls_exceeds_the_threshold.rst @@ -0,0 +1,95 @@ +:original_name: ALM-45179.html + +.. _ALM-45179: + +ALM-45179 Number of Failed OBS readFully API Calls Exceeds the Threshold +======================================================================== + +Description +----------- + +The system checks whether the number of failed OBS readFully API calls exceeds the threshold every 30 seconds. This alarm is generated when the number of failed API calls exceeds the threshold. + +This alarm is automatically cleared when the number of failed OBS readFully API calls is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45179 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Certain upper-layer big data computing tasks will fail to execute. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Thresholds**. On the **Thresholds** page, choose **meta** > **Number of failed calls to the OBS readFully interface**. In the right pane, set **Threshold** or **Trigger Count** to a larger value as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-45179__li1689719391331: + + Contact OBS O&M personnel to check whether the OBS service is normal. + + - If yes, go to :ref:`4 `. + - If no, contact OBS O&M personnel to restore the OBS service. + +**Collect fault information.** + +4. .. _alm-45179__li1591285591014: + + On FusionInsight Manager, choose **Cluster** > **Services** > **meta**. On the page that is displayed, click the **Chart** tab. On this tab page, select **OBS data read operation** in the **Chart Category** area. In the **Number of failed calls to the OBS readFully interface-All Instances** chart, view the host name of the instance that has the maximum number of failed OBS readFully API calls. For example, the host name is **node-ana-corevpeO003**. + + |image1| + +5. Choose **O&M** > **Log** > **Download** and select **meta** and **meta** under it for **Service**. + +6. Select the host obtained in :ref:`4 ` for **Hosts**. + +7. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001298155472.png +.. |image2| image:: /_static/images/en-us_image_0000001296525586.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45180_number_of_failed_obs_read_api_calls_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45180_number_of_failed_obs_read_api_calls_exceeds_the_threshold.rst new file mode 100644 index 0000000..a2c4ca4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45180_number_of_failed_obs_read_api_calls_exceeds_the_threshold.rst @@ -0,0 +1,95 @@ +:original_name: ALM-45180.html + +.. _ALM-45180: + +ALM-45180 Number of Failed OBS read API Calls Exceeds the Threshold +=================================================================== + +Description +----------- + +The system checks whether the number of failed OBS read API calls exceeds the threshold every 30 seconds. This alarm is generated when the number of failed API calls exceeds the threshold. + +This alarm is automatically cleared when the number of failed OBS read API calls is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45180 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Certain upper-layer big data computing tasks will fail to execute. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Thresholds**. On the **Thresholds** page, choose **meta** > **Number of failed calls to the OBS read interface**. In the right pane, set **Threshold** or **Trigger Count** to a larger value as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-45180__li6758181214398: + + Contact OBS O&M personnel to check whether the OBS service is normal. + + - If yes, go to :ref:`4 `. + - If no, contact OBS O&M personnel to restore the OBS service. + +**Collect fault information.** + +4. .. _alm-45180__li1591285591014: + + On FusionInsight Manager, choose **Cluster** > **Services** > **meta**. On the page that is displayed, click the **Chart** tab. On this tab page, select **OBS data read operation** in the **Chart Category** area. In the **Number of failed calls to the OBS read interface-All Instances** chart, view the host name of the instance that has the maximum number of failed OBS read API calls. For example, the host name is **node-ana-corevpeO003**. + + |image1| + +5. Choose **O&M** > **Log** > **Download** and select **meta** and **meta** under it for **Service**. + +6. Select the host obtained in :ref:`4 ` for **Hosts**. + +7. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001297838112.png +.. |image2| image:: /_static/images/en-us_image_0000001296365606.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45181_number_of_failed_obs_write_api_calls_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45181_number_of_failed_obs_write_api_calls_exceeds_the_threshold.rst new file mode 100644 index 0000000..7421cd9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45181_number_of_failed_obs_write_api_calls_exceeds_the_threshold.rst @@ -0,0 +1,95 @@ +:original_name: ALM-45181.html + +.. _ALM-45181: + +ALM-45181 Number of Failed OBS write API Calls Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks whether the number of failed OBS write API calls exceeds the threshold every 30 seconds. This alarm is generated when the number of failed API calls exceeds the threshold. + +This alarm is automatically cleared when the number of failed OBS write API calls is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45181 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Certain upper-layer big data computing tasks will fail to execute. + +Possible Causes +--------------- + +An execution exception or severe timeout occurs on the OBS server. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Thresholds**. On the **Thresholds** page, choose **meta** > **Number of failed calls to the OBS write interface**. In the right pane, set **Threshold** or **Trigger Count** to a larger value as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-45181__li1689719391331: + + Contact OBS O&M personnel to check whether the OBS service is normal. + + - If yes, go to :ref:`4 `. + - If no, contact OBS O&M personnel to restore the OBS service. + +**Collect fault information.** + +4. .. _alm-45181__li1591285591014: + + On FusionInsight Manager, choose **Cluster** > **Services** > **meta**. On the page that is displayed, click the **Chart** tab. On this tab page, select **OBS data write operation** in the **Chart Category** area. In the **Number of failed calls to the OBS write interface-All Instances** chart, view the host name of the instance that has the maximum number of failed OBS write API calls. For example, the host name is **node-ana-corevpeO003**. + + |image1| + +5. Choose **O&M** > **Log** > **Download** and select **meta** and **meta** under it for **Service**. + +6. Select the host obtained in :ref:`4 ` for **Hosts**. + +7. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001351040425.png +.. |image2| image:: /_static/images/en-us_image_0000001349585581.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45182_number_of_throttled_obs_api_calls_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45182_number_of_throttled_obs_api_calls_exceeds_the_threshold.rst new file mode 100644 index 0000000..69dc54d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45182_number_of_throttled_obs_api_calls_exceeds_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: ALM-45182.html + +.. _ALM-45182: + +ALM-45182 Number of Throttled OBS API Calls Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks whether the number of throttled OBS API calls exceeds the threshold every 30 seconds. This alarm is generated when the number of throttled API calls exceeds the threshold. + +This alarm is automatically cleared when the number of throttled OBS API calls is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45182 Minor Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Certain upper-layer big data computing tasks will fail to execute. + +Possible Causes +--------------- + +The frequency of requesting OBS APIs is too high. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Thresholds**. On the **Thresholds** page, choose **meta** > **Number of Throttled OBS API Calls**. In the right pane, set **Threshold** or **Trigger Count** to a larger value as required. + +#. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm-45182__li1689719391331: + + Contact OBS O&M personnel to check whether the OBS service is normal. + + - If yes, go to :ref:`4 `. + - If no, contact OBS O&M personnel to restore the OBS service. + +**Collect fault information.** + +4. .. _alm-45182__li19825189163317: + + On FusionInsight Manager, choose **Cluster** > **Services** > **meta**. On the page that is displayed, click the **Chart** tab. On this tab page, select **OBS Throttled** in the **Chart Category** area. In the **Number of Throttled OBS API Calls-All Instances** chart, view the host name of the instance that has the maximum number of throttled OBS API calls. For example, the host name is **node-ana-corevpeO003**. + + |image1| + +5. Choose **O&M** > **Log** > **Download** and select **meta** and **meta** under it for **Service**. + +6. Select the host obtained in :ref:`4 ` for **Hosts**. + +7. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 30 minutes ahead of and after the alarm generation time respectively. Then, click **Download**. +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001297841644.png +.. |image2| image:: /_static/images/en-us_image_0000001349825057.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45275_ranger_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45275_ranger_service_unavailable.rst new file mode 100644 index 0000000..0f06c58 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45275_ranger_service_unavailable.rst @@ -0,0 +1,102 @@ +:original_name: ALM-45275.html + +.. _ALM-45275: + +ALM-45275 Ranger Service Unavailable +==================================== + +Description +----------- + +The alarm module checks the Ranger service status every 180 seconds. This alarm is generated if the Ranger service is abnormal. + +This alarm is cleared after the Ranger service recovers. + +Attributes +---------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +45275 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ========================================= +Name Meaning +=========== ========================================= +Source Cluster for which the alarm is generated. +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +When the Ranger service is unavailable, Ranger cannot work properly and the native Ranger UI cannot be accessed. + +Possible Causes +--------------- + +- The DBService service on which Ranger depends is abnormal. +- The RangerAdmin role instance is abnormal. + +Procedure +--------- + +**Check the DBService process status.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms**. On the displayed page, check whether the ALM-27001 DBService Service Unavailable alarm is reported. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm-45275__li24833719161349: + + Rectify the DBService service fault by following the handling procedure of ALM-27001 DBService Service Unavailable. After the DBService alarm is cleared, check whether Ranger Service Unavailable alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +**Check all RangerAdmin instances.** + +3. .. _alm-45275__li43869374161349: + + Log in to the node where the RangerAdmin instance is located as user **omm** and run the **ps -ef|grep "proc_rangeradmin"** command to check whether the RangerAdmin process exists on the current node. + + - If yes, go to :ref:`5 `. + - If no, restart the faulty RangerAdmin instance or Ranger service and go to :ref:`4 `. + +4. .. _alm-45275__li60791811161349: + + In the alarm list, check whether the alarm "Ranger Service Unavailable" is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Collect the fault information.** + +5. .. _alm-45275__li16749195915615: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +6. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +7. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +8. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault that triggers the alarm is rectified, the alarm is automatically cleared. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293180907.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45276_abnormal_rangeradmin_status.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45276_abnormal_rangeradmin_status.rst new file mode 100644 index 0000000..738c1cf --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45276_abnormal_rangeradmin_status.rst @@ -0,0 +1,88 @@ +:original_name: ALM-45276.html + +.. _ALM-45276: + +ALM-45276 Abnormal RangerAdmin Status +===================================== + +Description +----------- + +The alarm module checks the RangerAdmin service status every 60 seconds. This alarm is generated if RangerAdmin is unavailable. + +This alarm is automatically cleared after the RangerAdmin service recovers. + +Attributes +---------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +45276 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ========================================= +Name Meaning +=========== ========================================= +Source Cluster for which the alarm is generated. +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the status of a RangerAdmin is abnormal, access to the Ranger native UI is not affected. If there are two abnormal RangerAdmin instances, the Ranger native UI cannot be accessed and operations such as creating, modifying, and deleting policies are unavailable. + +Possible Causes +--------------- + +The RangerAdmin port is not started. + +Procedure +--------- + +**Check the port process.** + +#. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, and click |image1| to view the name of the host for which the alarm is generated. + +#. Log in to the node where the RangerAdmin instance is located as user **omm**. Run the **ps -ef|grep "proc_rangeradmin" \| grep -v grep \| awk -F ' ' '{print $2}'** command to obtain **pid** of the RangerAdmin process, and run the **netstat -anp|grep pid \| grep LISTEN** command to check whether the RangerAdmin process listens to port 21401 in the security mode and port 21400 in standard mode. + + - If yes, go to :ref:`4 `. + - If no, restart the faulty RangerAdmin instance or Ranger service and go to :ref:`3 `. + +#. .. _alm-45276__li24833719161349: + + In the alarm list, check whether the "Abnormal RangerAdmin Status" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Collect the fault information.** + +4. .. _alm-45276__li16749195915615: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +5. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +6. Click |image2| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +7. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault that triggers the alarm is rectified, the alarm is automatically cleared. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001072559365.png +.. |image2| image:: /_static/images/en-us_image_0293234930.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45277_rangeradmin_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45277_rangeradmin_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..1ef0fc0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45277_rangeradmin_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45277.html + +.. _ALM-45277: + +ALM-45277 RangerAdmin Heap Memory Usage Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the heap memory usage of the RangerAdmin service every 60 seconds. This alarm is generated when the system detects that the heap memory usage of the RangerAdmin instance exceeds the threshold (95% of the maximum memory) for 10 consecutive times. This alarm is cleared when the heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45277 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The heap memory usage of the RangerAdmin instance is high or the heap memory is improperly allocated. + +Procedure +--------- + +**Check the heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45277 RangerAdmin Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45277__li58624704: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **RangerAdmin Heap Memory Usage**. Click **OK**. + +#. Check whether the heap memory used by RangerAdmin reaches the threshold (95% of the maximum heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45277__li11521246145513: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **RangerAdmin** > **Instance Configuration**. Click **All Configurations**, and choose **RangerAdmin** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for RangerAdmin cannot meet the heap memory required by the RangerAdmin process. You are advised to check the heap memory usage of RangerAdmin and change the value of **-Xmx** in **GC_OPTS** to the twice of the heap memory used by RangerAdmin. The value can be changed based on the actual service scenario. For details, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45277__li42224042151734: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293235730.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45278_rangeradmin_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45278_rangeradmin_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..6406314 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45278_rangeradmin_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45278.html + +.. _ALM-45278: + +ALM-45278 RangerAdmin Direct Memory Usage Exceeds the Threshold +=============================================================== + +Description +----------- + +The system checks the direct memory usage of the RangerAdmin service every 60 seconds. This alarm is generated when the direct memory usage of the RangerAdmin instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the direct memory usage of RangerAdmin is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45278 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Direct memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The direct memory of the RangerAdmin instance is overused or the direct memory is inappropriately allocated. As a result, the memory usage exceeds the threshold. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45278 RangerAdmin Direct Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45278__li7677390: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **RangerAdmin Direct Memory Usage**. Click **OK**. + +#. Check whether the direct memory used by RangerAdmin reaches the threshold (80% of the maximum direct memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45278__li10450762161055: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **RangerAdmin** > **Instance Configuration**. Click **All Configurations**, and choose **RangerAdmin** > **System**. Increase the value of **-XX:MaxDirectMemorySize** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the direct memory configured for RangerAdmin cannot meet the direct memory required by the RangerAdmin process. You are advised to check the direct memory usage of RangerAdmin and change the value of **-XX:MaxDirectMemorySize** in **GC_OPTS** to the twice of the direct memory used by RangerAdmin. You can change the value based on the actual service scenario. For details, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45278__d0e43963: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293242788.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45279_rangeradmin_non_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45279_rangeradmin_non_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..8cfe9a3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45279_rangeradmin_non_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-45279.html + +.. _ALM-45279: + +ALM-45279 RangerAdmin Non Heap Memory Usage Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the non-heap memory usage of the RangerAdmin service every 60 seconds. This alarm is generated when the non-heap memory usage of the RangerAdmin instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the non-heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45279 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Non-heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The non-heap memory usage of the RangerAdmin instance is high or the non-heap memory is improperly allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45279 RangerAdmin Non Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **RangerAdmin Non Heap Memory Usage**. Click **OK**. + +#. Check whether the non-heap memory used by RangerAdmin reaches the threshold (80% of the maximum non-heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45279__li29985659161559: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **RangerAdmin** > **Instance Configuration**. Click **All Configurations**, and choose **RangerAdmin** > **System**. Set **-XX: MaxPermSize** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the non-heap memory size configured for the RangerAdmin instance cannot meet the non-heap memory required by the RangerAdmin process. You are advised to change the value of **-XX:MaxPermSize** in **GC_OPTS** to the twice of the current non-heap memory usage or change the value based on the site requirements. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45279__d0e44186: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293245149.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45280_rangeradmin_gc_duration_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45280_rangeradmin_gc_duration_exceeds_the_threshold.rst new file mode 100644 index 0000000..11a61e0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45280_rangeradmin_gc_duration_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45280.html + +.. _ALM-45280: + +ALM-45280 RangerAdmin GC Duration Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the GC duration of the RangerAdmin process every 60 seconds. This alarm is generated when the GC duration of the RangerAdmin process exceeds the threshold (12 seconds by default) for five consecutive times. This alarm is cleared when the GC duration is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45280 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The RangerAdmin responds slowly. + +Possible Causes +--------------- + +The heap memory of the RangerAdmin instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45280 RangerAdmin GC Duration Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45280__li43047473: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated and click the drop-down list in the upper right corner of the chart area. Choose **Customize** > **GC** > **RangerAdmin GC Duration**. Click **OK**. + +#. Check whether the GC duration of the RangerAdmin process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45280__d0e44388: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **RangerAdmin** > **Instance Configuration**. Click **All Configurations**, and choose **RangerAdmin** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for RangerAdmin cannot meet the heap memory required by the RangerAdmin process. You are advised to check the heap memory usage of RangerAdmin and change the value of **-Xmx** in **GC_OPTS** to the twice of the heap memory used by RangerAdmin. The value can be changed based on the actual service scenario. For details, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45280__d0e44409: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293246465.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45281_usersync_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45281_usersync_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..69eca03 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45281_usersync_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-45281.html + +.. _ALM-45281: + +ALM-45281 UserSync Heap Memory Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the heap memory usage of the UserSync service every 60 seconds. This alarm is generated when the system detects that the heap memory usage of the UserSync instance exceeds the threshold (95% of the maximum memory) for 10 consecutive times. This alarm is cleared when the heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45281 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The heap memory usage of the UserSync instance is high or the heap memory is improperly allocated. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45281 UserSync Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45281__li58624704: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **UserSync Heap Memory Usage**. Click **OK**. + +#. Check whether the heap memory used by UserSync reaches the threshold (95% of the maximum heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45281__li11521246145513: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **UserSync** > **Instance Configuration**. Click **All Configurations**, and choose **UserSync** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for UserSync cannot meet the heap memory required by the UserSync process. You are advised to change the **-Xmx** value of **GC_OPTS** to twice that of the heap memory used by UserSync. You can change the value based on the actual service scenario. For details about how to check the UserSync heap memory usage, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45281__li42224042151734: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293246731.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45282_usersync_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45282_usersync_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..8449d7a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45282_usersync_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45282.html + +.. _ALM-45282: + +ALM-45282 UserSync Direct Memory Usage Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the direct memory usage of the UserSync service every 60 seconds. This alarm is generated when the direct memory usage of the UserSync instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the UserSync direct memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45282 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Direct memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The direct memory of the UserSync instance is overused or the direct memory is inappropriately allocated. As a result, the memory usage exceeds the threshold. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45282 UserSync Direct Memory Usage Exceeds the Threshold**. Check the location information of the alarm. Check the name of the instance host for which the alarm is generated. + +#. .. _alm-45282__li7677390: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **UserSync Direct Memory Usage**. Click **OK**. + +#. Check whether the direct memory used by the UserSync reaches the threshold (80% of the maximum direct memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45282__li10450762161055: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **UserSync** > **Instance Configuration**. Click **All Configurations**, and choose **UserSync** > **System**. Increase the value of **-XX:MaxDirectMemorySize** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the direct memory configured for UserSync cannot meet the direct memory required by the UserSync process. You are advised to check the direct memory usage of UserSync and change the value of **-XX:MaxDirectMemorySize** in **GC_OPTS** to the twice of the direct memory used by UserSync. You can change the value based on the actual service scenario. For details, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45282__d0e43963: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293247048.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45283_usersync_non_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45283_usersync_non_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..31d4f31 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45283_usersync_non_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-45283.html + +.. _ALM-45283: + +ALM-45283 UserSync Non Heap Memory Usage Exceeds the Threshold +============================================================== + +Description +----------- + +The system checks the non-heap memory usage of the UserSync service every 60 seconds. This alarm is generated when the non-heap memory usage of the UserSync instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the non-heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45283 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Non-heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The non-heap memory of the UserSync process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45283 UserSync Non Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **UserSync Non Heap Memory Usage**. Click **OK**. + +#. Check whether the non-heap memory used by UserSync reaches the threshold (80% of the maximum non-heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45283__li29985659161559: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **UserSync** > **Instance Configuration**. Click **All Configurations**, and choose **UserSync** > **System**. Set **-XX: MaxPermSize** in the **GC_OPTS** parameter to a larger value based on site requirements and click **Save** to save the configuration. + + .. note:: + + If this alarm is generated, the non-heap memory size configured for the UserSync instance cannot meet the non-heap memory required by the UserSync process. You are advised to change the **-XX:MaxPermSize** value of **GC_OPTS** to twice that of the current non-heap memory size or change the value based on the site requirements. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45283__d0e44186: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293247437.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45284_usersync_garbage_collection_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45284_usersync_garbage_collection_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..976ae9d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45284_usersync_garbage_collection_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,96 @@ +:original_name: ALM-45284.html + +.. _ALM-45284: + +ALM-45284 UserSync Garbage Collection (GC) Time Exceeds the Threshold +===================================================================== + +Description +----------- + +The system checks the GC duration of the UserSync process every 60 seconds. This alarm is generated when the GC duration of the UserSync process exceeds the threshold (12 seconds by default) for five consecutive times. This alarm is cleared when the GC duration is less than the threshold. + +Attributes +---------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +45284 Major Yes +======== ============== ===================== + +Parameters +---------- + +================= ========================================= +Name Meaning +================= ========================================= +Source Cluster for which the alarm is generated. +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +Trigger Condition Threshold for triggering the alarm. +================= ========================================= + +Impact on the System +-------------------- + +UserSync responds slowly. + +Possible Causes +--------------- + +The heap memory of the UserSync instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC time.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45284 UserSync Garbage Collection (GC) Time Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45284__li43047473: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated and click the drop-down list in the upper right corner of the chart area. Choose **Customize** > **GC** > **UserSync GC Duration**. Click **OK**. + +#. Check whether the GC duration of the UserSync process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45284__d0e44388: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **UserSync** > **Instance Configuration**. Click **All Configurations**, and choose **UserSync** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for UserSync cannot meet the heap memory required by the UserSync process. You are advised to change the value of **GC_OPTS** to the twice that of the heap memory used by UserSync. You can change the value based on the actual service scenario. For details about how to check the UserSync heap memory usage, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45284__d0e44409: + + On FusionInsight Manager, choose **O&M** > **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +After the fault that triggers the alarm is rectified, the alarm is automatically cleared. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293267262.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45285_tagsync_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45285_tagsync_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b3223da --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45285_tagsync_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-45285.html + +.. _ALM-45285: + +ALM-45285 TagSync Heap Memory Usage Exceeds the Threshold +========================================================= + +Description +----------- + +The system checks the heap memory usage of the TagSync service every 60 seconds. This alarm is generated when the heap memory usage of the TagSync instance exceeds the threshold (95% of the maximum memory) for 10 consecutive times. This alarm is cleared when the heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45285 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The heap memory usage of the TagSync instance is high or the heap memory is improperly allocated. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45285 TagSync Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45285__li58624704: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **TagSync Heap Memory Usage**. Click **OK**. + +#. Check whether the heap memory used by TagSync reaches the threshold (95% of the maximum heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45285__li11521246145513: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **TagSync** > **Instance Configuration**. Click **All Configurations** and choose **TagSync** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for TagSync cannot meet the heap memory required by the TagSync process. You are advised to change the **-Xmx** value of **GC_OPTS** to twice that of the heap memory used by TagSync. You can change the value based on the actual service scenario. For details about how to check the TagSync heap memory usage, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45285__li42224042151734: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293268152.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45286_tagsync_direct_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45286_tagsync_direct_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..e1f2c04 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45286_tagsync_direct_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45286.html + +.. _ALM-45286: + +ALM-45286 TagSync Direct Memory Usage Exceeds the Threshold +=========================================================== + +Description +----------- + +The system checks the direct memory usage of the TagSync service every 60 seconds. This alarm is generated when the direct memory usage of the TagSync instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the TagSync direct memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45286 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Direct memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The direct memory of the TagSync instance is overused or the direct memory is inappropriately allocated. As a result, the memory usage exceeds the threshold. + +Procedure +--------- + +**Check the direct memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45286 TagSync Direct Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45286__li7677390: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **TagSync Direct Memory Usage**. Click **OK**. + +#. Check whether the direct memory used by the TagSync reaches the threshold (80% of the maximum direct memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45286__li10450762161055: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **TagSync** > **Instance Configuration**. Click **All Configurations** and choose **TagSync** > **System**. Increase the value of **-XX:MaxDirectMemorySize** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the direct memory configured for TagSync cannot meet the direct memory required by the TagSync process. You are advised to check the direct memory usage of TagSync and change the value of **-XX:MaxDirectMemorySize** in **GC_OPTS** to the twice of the direct memory used by TagSync. You can change the value based on the actual service scenario. For details, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45286__d0e43963: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293269028.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45287_tagsync_non_heap_memory_usage_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45287_tagsync_non_heap_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..6cbe80f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45287_tagsync_non_heap_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: ALM-45287.html + +.. _ALM-45287: + +ALM-45287 TagSync Non Heap Memory Usage Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the non-heap memory usage of the TagSync service every 60 seconds. This alarm is generated when the non-heap memory usage of the TagSync instance exceeds the threshold (80% of the maximum memory) for five consecutive times. This alarm is cleared when the non-heap memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45287 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Non-heap memory overflow may cause service breakdown. + +Possible Causes +--------------- + +The non-heap memory of the TagSync process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +**Check non-heap memory usage.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45287 TagSync Non Heap Memory Usage Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated. Click the drop-down list in the upper right corner of the chart area and choose **Customize** > **CPU and Memory** > **TagSync Non Heap Memory Usage**. Click **OK**. + +#. Check whether the non-heap memory used by TagSync reaches the threshold (80% of the maximum non-heap memory by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45287__li29985659161559: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **TagSync** > **Instance Configuration**. Click **All Configurations** and choose **TagSync** > **System**. Set **-XX: MaxPermSize** in the **GC_OPTS** parameter to a larger value based on site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the non-heap memory size configured for the TagSync instance cannot meet the non-heap memory required by the TagSync process. You are advised to change the **-XX:MaxPermSize** value of **GC_OPTS** to twice that of the current non-heap memory size or change the value based on the site requirements. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45287__d0e44186: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293269047.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45288_tagsync_garbage_collection_gc_time_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45288_tagsync_garbage_collection_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..e01fc76 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45288_tagsync_garbage_collection_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,100 @@ +:original_name: ALM-45288.html + +.. _ALM-45288: + +ALM-45288 TagSync Garbage Collection (GC) Time Exceeds the Threshold +==================================================================== + +Description +----------- + +The system checks the GC duration of the TagSync process every 60 seconds. This alarm is generated when the GC duration of the TagSync process exceeds the threshold (12 seconds by default) for five consecutive times. This alarm is cleared when the GC duration is less than the threshold. + +Attributes +---------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45288 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Name | Meaning | ++===================+=========================================================+ +| Source | Specifies the cluster for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +TagSync responds slowly. + +Possible Causes +--------------- + +The heap memory of the TagSync instance is overused or the heap memory is inappropriately allocated. As a result, GCs occur frequently. + +Procedure +--------- + +**Check the GC duration.** + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45288 TagSync Garbage Collection (GC) Time Exceeds the Threshold**. Check the location information of the alarm and view the host name of the instance for which the alarm is generated. + +#. .. _alm-45288__li43047473: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance**. Select the role corresponding to the host name of the instance for which the alarm is generated and click the drop-down list in the upper right corner of the chart area. Choose **Customize** > **GC** > **TagSync GC Duration**. Click **OK**. + +#. Check whether the GC duration of the TagSync process collected every minute exceeds the threshold (12 seconds by default). + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`6 `. + +#. .. _alm-45288__d0e44388: + + On FusionInsight Manager, choose **Cluster** > **Services** > **Ranger** > **Instance** > **TagSync** > **Instance Configuration**. Click **All Configurations** and choose **TagSync** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter based on the site requirements and save the configuration. + + .. note:: + + If this alarm is generated, the heap memory configured for TagSync cannot meet the heap memory required by the TagSync process. You are advised to change the **-Xmx** value of **GC_OPTS** to twice that of the heap memory used by TagSync. You can change the value based on the actual service scenario. For details about how to check the TagSync heap memory usage, see :ref:`2 `. + +#. Restart the affected services or instances and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +**Collect the fault information.** + +6. .. _alm-45288__d0e44409: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +7. Expand the **Service** drop-down list, and select **Ranger** for the target cluster. + +8. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click **Download**. + +9. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0293269551.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45425_clickhouse_service_unavailable.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45425_clickhouse_service_unavailable.rst new file mode 100644 index 0000000..7bc762f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45425_clickhouse_service_unavailable.rst @@ -0,0 +1,141 @@ +:original_name: ALM-45425.html + +.. _ALM-45425: + +ALM-45425 ClickHouse Service Unavailable +======================================== + +Description +----------- + +The alarm module checks the ClickHouse instance status every 60 seconds. This alarm is generated when the alarm module detects that all ClickHouse instances are abnormal. + +This alarm is cleared when the system detects that any ClickHouse instance is restored and the alarm is cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +45425 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The ClickHouse service is abnormal. You cannot use FusionInsight Manager to perform cluster operations on the ClickHouse service. The ClickHouse service function is unavailable. + +Possible Causes +--------------- + +The configuration information in the **metrika.xml** file in the component configuration directory of the faulty ClickHouse instance node is inconsistent with that of the corresponding ClickHouse instance in the ZooKeeper. + +Procedure +--------- + +**Check whether the configuration in metrika.xml of the ClickHouse instance is correct.** + +#. Log in to FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse** > **Instance**, and locate the abnormal ClickHouse instance based on the alarm information. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`9 `. + +#. .. _alm-45425__li237743710398: + + Log in to the host where the ClickHouse service is abnormal and ping the IP address of another normal ClickHouse instance node to check whether the network connection is normal. + + - If yes, go to :ref:`3 `. + - If no, contact the network administrator to repair the network. + +3. .. _alm-45425__li156597363713: + + Choose **Cluster** > **Services** > **ClickHouse** > **Instance**, click the abnormal instance name in the **Role** column, click **Configurations**, search for **macros.id** in the search box, and find the value of **macros.id** of the current instance. + +4. Log in to the host where the ZooKeeper client is located and log in to the ZooKeeper client. + + Switch to the client installation directory. + + Example: **cd /opt/client** + + Run the following command to configure environment variables: + + **source bigdata_env** + + Run the following command to authenticate the user (skip this step in common mode): + + **kinit** *Component service user* + + Run the following command to log in to the client tool: + + **zkCli.sh -server** *service IP address of the node where the ZooKeeper role instance locates*\ **:**\ *client port* + +5. .. _alm-45425__li1377133713911: + + Run the following command to check whether the ClickHouse cluster topology information can be obtained. + + **get /clickhouse/config/**\ *value of* **macros.id** *in* :ref:`3 `/metrika.xml + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`9 `. + +6. .. _alm-45425__li1462431320505: + + Log in to the host where the ClickHouse instance is abnormal and go to the configuration directory of the ClickHouse instance. + + **cd** ${BIGDATA_HOME}\ **/FusionInsight_ClickHouse\_**\ *Version*\ **/**\ x_x\ **\_ClickHouseServer/etc** + + **cat metrika.xml** + +7. Check whether the cluster topology information on ZooKeeper obtained in :ref:`5 ` is the same as that in the **metrika.xml** file in the component configuration directory in :ref:`6 `. + + - If yes, check whether the alarm is cleared. If the alarm persists, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +8. .. _alm-45425__li113661428132312: + + On FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse**, click **More**, and select **Synchronize Configuration**. Then, check whether the service status is normal and whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Collect the fault information.** + +9. .. _alm-45425__li62779304563: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +10. Expand the **Service** drop-down list, and select **ClickHouse** for the target cluster. + +11. Choose the corresponding host form the host list. + +12. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +13. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0295554634.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45426_clickhouse_service_quantity_quota_usage_in_zookeeper_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45426_clickhouse_service_quantity_quota_usage_in_zookeeper_exceeds_the_threshold.rst new file mode 100644 index 0000000..cac3ee1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45426_clickhouse_service_quantity_quota_usage_in_zookeeper_exceeds_the_threshold.rst @@ -0,0 +1,146 @@ +:original_name: ALM-45426.html + +.. _ALM-45426: + +ALM-45426 ClickHouse Service Quantity Quota Usage in ZooKeeper Exceeds the Threshold +==================================================================================== + +Description +----------- + +The alarm module checks the quota usage of the ClickHouse service in the ZooKeeper every 60 seconds. This alarm is generated when the alarm module detects that the usage exceeds the threshold (90%). + +This alarm is cleared when the system detects that the usage is lower than the threshold and the alarm is cleared. + +Attribute +--------- + +======== =============== ========== +Alarm ID Alarm Severity Auto Clear +======== =============== ========== +45426 Major (default) Yes +======== =============== ========== + +Parameters +---------- + ++-------------+-------------------------------------------------------------------+ +| Name | Meaning | ++=============+===================================================================+ +| Source | Specifies the cluster or system for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+-------------------------------------------------------------------+ + +Impact on the System +-------------------- + +After the ZooKeeper quantity quota of the ClickHouse service exceeds the threshold, you cannot perform cluster operations on the ClickHouse service on FusionInsight Manager. As a result, the ClickHouse service cannot be used. + +Possible Causes +--------------- + +- When table data is created, inserted, or deleted, the ClickHouse creates znodes on ZooKeeper nodes. As the service volume increases, the number of znodes may exceed the configured threshold. +- No quota limit is set for the metadata directory **/clickhouse** of ClickHouse in ZooKeeper. + +Procedure +--------- + +**Check the number of znodes created by ClickHouse on ZooKeeper.** + +#. .. _alm-45426__li19429132053415: + + Log in to the host where the ZooKeeper client is located and log in to the ZooKeeper client. + + Switch to the client installation directory. + + Example: **cd /opt/client** + + Run the following command to configure environment variables: + + **source bigdata_env** + + Run the following command to authenticate the user (skip this step in common mode): + + **kinit** *Component service user* + + Run the following command to log in to the client tool: + + **zkCli.sh -server** *service IP address of the node where the ZooKeeper role instance locates*\ **:**\ *client port* + +#. Run the following command to check the quota used by the ClickHouse in the ZooKeeper and check whether the quota information is correctly set: + + **listquota /clickhouse** + + .. code-block:: + + absolute path is /zookeeper/quota/clickhouse + Quota for path /clickhouse does not exist. + + If the preceding information indicates that the quota configuration is incorrect, go to :ref:`3 `. + + If no, go to :ref:`5 `. + +#. .. _alm-45426__li17669171018349: + + Log in to FusionInsight Manager and choose **Cluster** > **Services** > **ZooKeeper**. On the displayed page, click **Configurations** and click **All Configurations**. On this sub-tab page, search for **quotas.auto.check.enable** to check whether its value is **true**. + + If the value is not **true**, change the value to **true** and click **Save**. + +#. On FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse**, click **More**, and select **Synchronize Configuration**. After the synchronization is successful, go to :ref:`1 `. + +#. .. _alm-45426__li169092437415: + + Run the following command and check whether the ratio of the **count** value of **Output stat** to the **count** value of **Output quota** in the command output is greater than **0.9**: + + **listquota /clickhouse** + + .. code-block:: + + absolute path is /zookeeper/quota/clickhouse + Output quota for /clickhouse count=200000,bytes=1000000000 + Output stat for /clickhouse count=2667,bytes=60063 + + In the preceding information, the **count** value of **Output stat** is **2667**, and the **count** value of **Output quota** is **200000**. + + - If yes, go to :ref:`6 `. + - If no, check whether the alarm is cleared 5 minutes later. If the alarm persists, go to :ref:`8 `. + +#. .. _alm-45426__li2547532101515: + + On FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse** > **Configurations** > **All Configurations**, search for the **clickhouse.zookeeper.quota.node.count** parameter, and change the value of this parameter to twice the **count** value of **Output stat** in :ref:`5 `. + +#. Restart the ClickHouse instance for which the alarm is generated, and check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, perform :ref:`6 ` again, and check whether the alarm is cleared 5 minutes later. If the alarm persists, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-45426__li93321856123314: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **ClickHouse** for the target cluster. + +10. Choose the corresponding host form the host list. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0295310731.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45427_clickhouse_service_capacity_quota_usage_in_zookeeper_exceeds_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45427_clickhouse_service_capacity_quota_usage_in_zookeeper_exceeds_the_threshold.rst new file mode 100644 index 0000000..ae15706 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45427_clickhouse_service_capacity_quota_usage_in_zookeeper_exceeds_the_threshold.rst @@ -0,0 +1,142 @@ +:original_name: ALM-45427.html + +.. _ALM-45427: + +ALM-45427 ClickHouse Service Capacity Quota Usage in ZooKeeper Exceeds the Threshold +==================================================================================== + +Description +----------- + +The alarm module checks the quota usage of the ClickHouse service in the ZooKeeper every 60 seconds. This alarm is generated when the alarm module detects that the usage exceeds the threshold (90%). + +This alarm is cleared when the system detects that the usage is lower than the threshold and the alarm is cleared. + +Attribute +--------- + +======== =============== ========== +Alarm ID Alarm Severity Auto Clear +======== =============== ========== +45427 Major (default) Yes +======== =============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +After the ZooKeeper quantity quota of the ClickHouse service exceeds the threshold, you cannot perform cluster operations on the ClickHouse service on FusionInsight Manager. As a result, the ClickHouse service cannot be used. + +Possible Causes +--------------- + +- When table data is created, inserted, or deleted, the ClickHouse creates znodes on ZooKeeper nodes. As the service volume increases, the capacity of znodes may exceed the configured threshold. +- No quota limit is set for the metadata directory **/clickhouse** of ClickHouse in ZooKeeper. + +Procedure +--------- + +**Check the znode capacity of the ClickHouse in the ZooKeeper.** + +#. .. _alm-45427__li19429132053415: + + Log in to the host where the ZooKeeper client is located and log in to the ZooKeeper client. + + Switch to the client installation directory. + + Example: **cd /opt/client** + + Run the following command to configure environment variables: + + **source bigdata_env** + + Run the following command to authenticate the user (skip this step in common mode): + + **kinit** *Component service user* + + Run the following command to log in to the client tool: + + **zkCli.sh -server** *service IP address of the node where the ZooKeeper role instance locates*\ **:**\ *client port* + +#. Run the following command to check the quota used by the ClickHouse in the ZooKeeper and check whether the quota information is correctly set: + + **listquota /clickhouse** + + .. code-block:: + + absolute path is /zookeeper/quota/clickhouse + Quota for path /clickhouse does not exist. + + - If the preceding information indicates that the quota configuration is incorrect, go to :ref:`3 `. + - If not, go to :ref:`5 `. + +#. .. _alm-45427__li17669171018349: + + Log in to FusionInsight Manager and choose **Cluster** > **Services** > **ZooKeeper**. On the displayed page, click **Configurations** and click **All Configurations**. On this sub-tab page, search for **quotas.auto.check.enable** to check whether its value is **true**. + + If the value is not **true**, change the value to **true** and click **Save**. + +#. On FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse**, click **More**, and select **Synchronize Configuration**. After the synchronization is successful, go to :ref:`1 `. + +#. .. _alm-45427__li10833143016438: + + Run the following command and check whether the ratio of the **bytes** value of **Output stat** to the **bytes** value of **Output quota** in the command output is greater than **0.9**: + + **listquota /clickhouse** + + .. code-block:: + + absolute path is /zookeeper/quota/clickhouse + Output quota for /clickhouse count=200000,bytes=1000000000 + Output stat for /clickhouse count=2667,bytes=60063 + + In the preceding information, the **bytes** value of **Output stat** is **60063**, and the **bytes** value of **Output quota** is **1000000000**. + + - If yes, go to :ref:`6 `. + - If no, check whether the alarm is cleared 5 minutes later. If the alarm persists, go to :ref:`8 `. + +#. .. _alm-45427__li157515124315: + + On FusionInsight Manager, choose **Cluster** > **Services** > **ClickHouse** > **Configurations** > **All Configurations**, search for the **clickhouse.zookeeper.quota.size** parameter, and change the value of this parameter to twice the **bytes** value of **Output stat** in :ref:`5 `. + +#. Restart the ClickHouse instance for which the alarm is generated, and check whether the alarm is cleared 5 minutes later. + + - If yes, no further action is required. + - If no, perform :ref:`6 ` again, and check whether the alarm is cleared 5 minutes later. If the alarm persists, go to :ref:`8 `. + +**Collect the fault information.** + +8. .. _alm-45427__li1460181994211: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +9. Expand the **Service** drop-down list, and select **ClickHouse** for the target cluster. + +10. Choose the corresponding host form the host list. + +11. Click |image1| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +12. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0295706662.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45428_clickhouse_disk_i_o_exception.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45428_clickhouse_disk_i_o_exception.rst new file mode 100644 index 0000000..28d6ba8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/alm-45428_clickhouse_disk_i_o_exception.rst @@ -0,0 +1,99 @@ +:original_name: ALM-45428.html + +.. _ALM-45428: + +ALM-45428 ClickHouse Disk I/O Exception +======================================= + +Description +----------- + +This alarm is generated when the alarm module detects EIO or EROFS errors during ClickHouse read and write every 60 seconds. + +Attribute +--------- + +======== =============== ========== +Alarm ID Alarm Severity Auto Clear +======== =============== ========== +45428 Major (default) No +======== =============== ========== + +Parameters +---------- + +=========== ======================================================= +Name Meaning +=========== ======================================================= +Source Specifies the cluster for which the alarm is generated. +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +- ClickHouse fails to read and write data. The INSERT, SELECT, and CREATE operations on the local tables may be abnormal. Distributed tables are not affected. +- Services are affected, and I/Os fail. + +Possible Causes +--------------- + +The disk is aged or has bad sectors. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Alarm** > **Alarms** > **ALM-45428 ClickHouse Disk I/O Exception**. Check the role name and the IP address of the host where the alarm is generated in **Location**. + +#. Use PuTTY to log in to the node for which the fault is generated as user **root**. + +#. Run the **df -h** command to check the mount directory and find the disk mounted to the faulty directory. + +#. Run the **smartctl -a /dev/sd\*** command to check disks. + + - If **SMART Health Status: OK** is displayed, as shown in the following figure, the disk is healthy. In this case, go to :ref:`7 `. + + |image1| + + - If the number following **Elements in grown defect list** is not 0, as shown in the following figure, the disk may have bad sectors. If **SMART Health Status: FAILURE** is displayed, the disk is in the sub-health state. In this case, go to :ref:`5 `. + + |image2| + +#. .. _alm-45428__li838232512215: + + Rectify the fault by following the instructions provided in "Hard Disk Mounted to the ClickHouse Partition Directory Is Faulty" in . + +#. After the fault is rectified, manually clear the alarm on FusionInsight Manager and check whether the alarm is generated again during the periodic check. + + - If yes, go to :ref:`7 `. + - If no, no further action is required. + +**Collect the fault information.** + +7. .. _alm-45428__li186532017115614: + + On FusionInsight Manager, choose **O&M**. In the navigation pane on the left, choose **Log** > **Download**. + +8. Expand the **Service** drop-down list, and select **ClickHouse** for the target cluster. + +9. Choose the corresponding host form the host list. + +10. Click |image3| in the upper right corner, and set **Start Date** and **End Date** for log collection to 1 hour ahead of and after the alarm generation time, respectively. Then, click **Download**. + +11. Contact O&M personnel and provide the collected logs. + +Alarm Clearing +-------------- + +If the alarm has no impact, manually clear the alarm. + +Related Information +------------------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001194201259.png +.. |image2| image:: /_static/images/en-us_image_0000001194201487.png +.. |image3| image:: /_static/images/en-us_image_0000001194317737.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/index.rst new file mode 100644 index 0000000..1f4d5a8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/alarm_reference_applicable_to_mrs_3.x/index.rst @@ -0,0 +1,504 @@ +:original_name: mrs_01_1298.html + +.. _mrs_01_1298: + +Alarm Reference (Applicable to MRS 3.\ *x*) +=========================================== + +- :ref:`ALM-12001 Audit Log Dumping Failure ` +- :ref:`ALM-12004 OLdap Resource Abnormal ` +- :ref:`ALM-12005 OKerberos Resource Abnormal ` +- :ref:`ALM-12006 Node Fault ` +- :ref:`ALM-12007 Process Fault ` +- :ref:`ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes ` +- :ref:`ALM-12011 Manager Data Synchronization Exception Between the Active and Standby Nodes ` +- :ref:`ALM-12014 Partition Lost ` +- :ref:`ALM-12015 Partition Filesystem Readonly ` +- :ref:`ALM-12016 CPU Usage Exceeds the Threshold ` +- :ref:`ALM-12017 Insufficient Disk Capacity ` +- :ref:`ALM-12018 Memory Usage Exceeds the Threshold ` +- :ref:`ALM-12027 Host PID Usage Exceeds the Threshold ` +- :ref:`ALM-12028 Number of Processes in the D State on a Host Exceeds the Threshold ` +- :ref:`ALM-12033 Slow Disk Fault ` +- :ref:`ALM-12034 Periodical Backup Failure ` +- :ref:`ALM-12035 Unknown Data Status After Recovery Task Failure ` +- :ref:`ALM-12038 Monitoring Indicator Dumping Failure ` +- :ref:`ALM-12039 Active/Standby OMS Databases Not Synchronized ` +- :ref:`ALM-12040 Insufficient System Entropy ` +- :ref:`ALM-12041 Incorrect Permission on Key Files ` +- :ref:`ALM-12042 Incorrect Configuration of Key Files ` +- :ref:`ALM-12045 Read Packet Dropped Rate Exceeds the Threshold ` +- :ref:`ALM-12046 Write Packet Dropped Rate Exceeds the Threshold ` +- :ref:`ALM-12047 Read Packet Error Rate Exceeds the Threshold ` +- :ref:`ALM-12048 Write Packet Error Rate Exceeds the Threshold ` +- :ref:`ALM-12049 Network Read Throughput Rate Exceeds the Threshold ` +- :ref:`ALM-12050 Network Write Throughput Rate Exceeds the Threshold ` +- :ref:`ALM-12051 Disk Inode Usage Exceeds the Threshold ` +- :ref:`ALM-12052 TCP Temporary Port Usage Exceeds the Threshold ` +- :ref:`ALM-12053 Host File Handle Usage Exceeds the Threshold ` +- :ref:`ALM-12054 Invalid Certificate File ` +- :ref:`ALM-12055 The Certificate File Is About to Expire ` +- :ref:`ALM-12057 Metadata Not Configured with the Task to Periodically Back Up Data to a Third-Party Server ` +- :ref:`ALM-12061 Process Usage Exceeds the Threshold ` +- :ref:`ALM-12062 OMS Parameter Configurations Mismatch with the Cluster Scale ` +- :ref:`ALM-12063 Unavailable Disk ` +- :ref:`ALM-12064 Host Random Port Range Conflicts with Cluster Used Port ` +- :ref:`ALM-12066 Trust Relationships Between Nodes Become Invalid ` +- :ref:`ALM-12067 Tomcat Resource Is Abnormal ` +- :ref:`ALM-12068 ACS Resource Exception ` +- :ref:`ALM-12069 AOS Resource Exception ` +- :ref:`ALM-12070 Controller Resource Is Abnormal ` +- :ref:`ALM-12071 Httpd Resource Is Abnormal ` +- :ref:`ALM-12072 FloatIP Resource Is Abnormal ` +- :ref:`ALM-12073 CEP Resource Is Abnormal ` +- :ref:`ALM-12074 FMS Resource Is Abnormal ` +- :ref:`ALM-12075 PMS Resource Is Abnormal ` +- :ref:`ALM-12076 GaussDB Resource Is Abnormal ` +- :ref:`ALM-12077 User omm Expired ` +- :ref:`ALM-12078 Password of User omm Expired ` +- :ref:`ALM-12079 User omm Is About to Expire ` +- :ref:`ALM-12080 Password of User omm Is About to Expire ` +- :ref:`ALM-12081 User ommdba Expired ` +- :ref:`ALM-12082 User ommdba Is About to Expire ` +- :ref:`ALM-12083 Password of User ommdba Is About to Expire ` +- :ref:`ALM-12084 Password of User ommdba Expired ` +- :ref:`ALM-12085 Service Audit Log Dump Failure ` +- :ref:`ALM-12087 System Is in the Upgrade Observation Period ` +- :ref:`ALM-12089 Inter-Node Network Is Abnormal ` +- :ref:`ALM-12101 AZ Unhealthy ` +- :ref:`ALM-12102 AZ HA Component Is Not Deployed Based on DR Requirements ` +- :ref:`ALM-12110 Failed to get ECS temporary AK/SK ` +- :ref:`ALM-12180 Suspended Disk I/O ` +- :ref:`ALM-13000 ZooKeeper Service Unavailable ` +- :ref:`ALM-13001 Available ZooKeeper Connections Are Insufficient ` +- :ref:`ALM-13002 ZooKeeper Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-13003 GC Duration of the ZooKeeper Process Exceeds the Threshold ` +- :ref:`ALM-13004 ZooKeeper Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-13005 Failed to Set the Quota of Top Directories of ZooKeeper Components ` +- :ref:`ALM-13006 Znode Number or Capacity Exceeds the Threshold ` +- :ref:`ALM-13007 Available ZooKeeper Client Connections Are Insufficient ` +- :ref:`ALM-13008 ZooKeeper Znode Usage Exceeds the Threshold ` +- :ref:`ALM-13009 ZooKeeper Znode Capacity Usage Exceeds the Threshold ` +- :ref:`ALM-13010 Znode Usage of a Directory with Quota Configured Exceeds the Threshold ` +- :ref:`ALM-14000 HDFS Service Unavailable ` +- :ref:`ALM-14001 HDFS Disk Usage Exceeds the Threshold ` +- :ref:`ALM-14002 DataNode Disk Usage Exceeds the Threshold ` +- :ref:`ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold ` +- :ref:`ALM-14006 Number of HDFS Files Exceeds the Threshold ` +- :ref:`ALM-14007 NameNode Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14009 Number of Dead DataNodes Exceeds the Threshold ` +- :ref:`ALM-14010 NameService Service Is Abnormal ` +- :ref:`ALM-14011 DataNode Data Directory Is Not Configured Properly ` +- :ref:`ALM-14012 JournalNode Is Out of Synchronization ` +- :ref:`ALM-14013 Failed to Update the NameNode FsImage File ` +- :ref:`ALM-14014 NameNode GC Time Exceeds the Threshold ` +- :ref:`ALM-14015 DataNode GC Time Exceeds the Threshold ` +- :ref:`ALM-14016 DataNode Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14017 NameNode Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14018 NameNode Non-heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14019 DataNode Non-heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14020 Number of Entries in the HDFS Directory Exceeds the Threshold ` +- :ref:`ALM-14021 NameNode Average RPC Processing Time Exceeds the Threshold ` +- :ref:`ALM-14022 NameNode Average RPC Queuing Time Exceeds the Threshold ` +- :ref:`ALM-14023 Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold ` +- :ref:`ALM-14024 Tenant Space Usage Exceeds the Threshold ` +- :ref:`ALM-14025 Tenant File Object Usage Exceeds the Threshold ` +- :ref:`ALM-14026 Blocks on DataNode Exceed the Threshold ` +- :ref:`ALM-14027 DataNode Disk Fault ` +- :ref:`ALM-14028 Number of Blocks to Be Supplemented Exceeds the Threshold ` +- :ref:`ALM-14029 Number of Blocks in a Replica Exceeds the Threshold ` +- :ref:`ALM-14030 HDFS Allows Write of Single-Replica Data ` +- :ref:`ALM-16000 Percentage of Sessions Connected to the HiveServer to Maximum Number Allowed Exceeds the Threshold ` +- :ref:`ALM-16001 Hive Warehouse Space Usage Exceeds the Threshold ` +- :ref:`ALM-16002 Hive SQL Execution Success Rate Is Lower Than the Threshold ` +- :ref:`ALM-16003 Background Thread Usage Exceeds the Threshold ` +- :ref:`ALM-16004 Hive Service Unavailable ` +- :ref:`ALM-16005 The Heap Memory Usage of the Hive Process Exceeds the Threshold ` +- :ref:`ALM-16006 The Direct Memory Usage of the Hive Process Exceeds the Threshold ` +- :ref:`ALM-16007 Hive GC Time Exceeds the Threshold ` +- :ref:`ALM-16008 Non-Heap Memory Usage of the Hive Process Exceeds the Threshold ` +- :ref:`ALM-16009 Map Number Exceeds the Threshold ` +- :ref:`ALM-16045 Hive Data Warehouse Is Deleted ` +- :ref:`ALM-16046 Hive Data Warehouse Permission Is Modified ` +- :ref:`ALM-16047 HiveServer Has Been Deregistered from ZooKeeper ` +- :ref:`ALM-16048 Tez or Spark Library Path Does Not Exist ` +- :ref:`ALM-17003 Oozie Service Unavailable ` +- :ref:`ALM-17004 Oozie Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-17005 Oozie Non Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-17006 Oozie Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-17007 Garbage Collection (GC) Time of the Oozie Process Exceeds the Threshold ` +- :ref:`ALM-18000 Yarn Service Unavailable ` +- :ref:`ALM-18002 NodeManager Heartbeat Lost ` +- :ref:`ALM-18003 NodeManager Unhealthy ` +- :ref:`ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold ` +- :ref:`ALM-18009 Heap Memory Usage of JobHistoryServer Exceeds the Threshold ` +- :ref:`ALM-18010 ResourceManager GC Time Exceeds the Threshold ` +- :ref:`ALM-18011 NodeManager GC Time Exceeds the Threshold ` +- :ref:`ALM-18012 JobHistoryServer GC Time Exceeds the Threshold ` +- :ref:`ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-18014 NodeManager Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-18015 JobHistoryServer Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-18016 Non Heap Memory Usage of ResourceManager Exceeds the Threshold ` +- :ref:`ALM-18017 Non Heap Memory Usage of NodeManager Exceeds the Threshold ` +- :ref:`ALM-18018 NodeManager Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-18019 Non Heap Memory Usage of JobHistoryServer Exceeds the Threshold ` +- :ref:`ALM-18020 Yarn Task Execution Timeout ` +- :ref:`ALM-18021 Mapreduce Service Unavailable ` +- :ref:`ALM-18022 Insufficient Yarn Queue Resources ` +- :ref:`ALM-18023 Number of Pending Yarn Tasks Exceeds the Threshold ` +- :ref:`ALM-18024 Pending Yarn Memory Usage Exceeds the Threshold ` +- :ref:`ALM-18025 Number of Terminated Yarn Tasks Exceeds the Threshold ` +- :ref:`ALM-18026 Number of Failed Yarn Tasks Exceeds the Threshold ` +- :ref:`ALM-19000 HBase Service Unavailable ` +- :ref:`ALM-19006 HBase Replication Sync Failed ` +- :ref:`ALM-19007 HBase GC Time Exceeds the Threshold ` +- :ref:`ALM-19008 Heap Memory Usage of the HBase Process Exceeds the Threshold ` +- :ref:`ALM-19009 Direct Memory Usage of the HBase Process Exceeds the Threshold ` +- :ref:`ALM-19011 RegionServer Region Number Exceeds the Threshold ` +- :ref:`ALM-19012 HBase System Table Directory or File Lost ` +- :ref:`ALM-19013 Duration of Regions in transaction State Exceeds the Threshold ` +- :ref:`ALM-19014 Capacity Quota Usage on ZooKeeper Exceeds the Threshold Severely ` +- :ref:`ALM-19015 Quantity Quota Usage on ZooKeeper Exceeds the Threshold ` +- :ref:`ALM-19016 Quantity Quota Usage on ZooKeeper Exceeds the Threshold Severely ` +- :ref:`ALM-19017 Capacity Quota Usage on ZooKeeper Exceeds the Threshold ` +- :ref:`ALM-19018 HBase Compaction Queue Size Exceeds the Threshold ` +- :ref:`ALM-19019 Number of HBase HFiles to Be Synchronized Exceeds the Threshold ` +- :ref:`ALM-19020 Number of HBase WAL Files to Be Synchronized Exceeds the Threshold ` +- :ref:`ALM-20002 Hue Service Unavailable ` +- :ref:`ALM-24000 Flume Service Unavailable ` +- :ref:`ALM-24001 Flume Agent Exception ` +- :ref:`ALM-24003 Flume Client Connection Interrupted ` +- :ref:`ALM-24004 Exception Occurs When Flume Reads Data ` +- :ref:`ALM-24005 Exception Occurs When Flume Transmits Data ` +- :ref:`ALM-24006 Heap Memory Usage of Flume Server Exceeds the Threshold ` +- :ref:`ALM-24007 Flume Server Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-24008 Flume Server Non Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-24009 Flume Server Garbage Collection (GC) Time Exceeds the Threshold ` +- :ref:`ALM-24010 Flume Certificate File Is Invalid or Damaged ` +- :ref:`ALM-24011 Flume Certificate File Is About to Expire ` +- :ref:`ALM-24012 Flume Certificate File Has Expired ` +- :ref:`ALM-24013 Flume MonitorServer Certificate File Is Invalid or Damaged ` +- :ref:`ALM-24014 Flume MonitorServer Certificate Is About to Expire ` +- :ref:`ALM-24015 Flume MonitorServer Certificate File Has Expired ` +- :ref:`ALM-25000 LdapServer Service Unavailable ` +- :ref:`ALM-25004 Abnormal LdapServer Data Synchronization ` +- :ref:`ALM-25005 nscd Service Exception ` +- :ref:`ALM-25006 Sssd Service Exception ` +- :ref:`ALM-25500 KrbServer Service Unavailable ` +- :ref:`ALM-26051 Storm Service Unavailable ` +- :ref:`ALM-26052 Number of Available Supervisors of the Storm Service Is Less Than the Threshold ` +- :ref:`ALM-26053 Storm Slot Usage Exceeds the Threshold ` +- :ref:`ALM-26054 Nimbus Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-27001 DBService Service Unavailable ` +- :ref:`ALM-27003 DBService Heartbeat Interruption Between the Active and Standby Nodes ` +- :ref:`ALM-27004 Data Inconsistency Between Active and Standby DBServices ` +- :ref:`ALM-27005 Database Connections Usage Exceeds the Threshold ` +- :ref:`ALM-27006 Disk Space Usage of the Data Directory Exceeds the Threshold ` +- :ref:`ALM-27007 Database Enters the Read-Only Mode ` +- :ref:`ALM-38000 Kafka Service Unavailable ` +- :ref:`ALM-38001 Insufficient Kafka Disk Capacity ` +- :ref:`ALM-38002 Kafka Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-38004 Kafka Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-38005 GC Duration of the Broker Process Exceeds the Threshold ` +- :ref:`ALM-38006 Percentage of Kafka Partitions That Are Not Completely Synchronized Exceeds the Threshold ` +- :ref:`ALM-38007 Status of Kafka Default User Is Abnormal ` +- :ref:`ALM-38008 Abnormal Kafka Data Directory Status ` +- :ref:`ALM-38009 Busy Broker Disk I/Os (Applicable to Versions Later Than MRS 3.1.0) ` +- :ref:`ALM-38010 Topics with Single Replica ` +- :ref:`ALM-43001 Spark2x Service Unavailable ` +- :ref:`ALM-43006 Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold ` +- :ref:`ALM-43007 Non-Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold ` +- :ref:`ALM-43008 The Direct Memory Usage of the JobHistory2x Process Exceeds the Threshold ` +- :ref:`ALM-43009 JobHistory2x Process GC Time Exceeds the Threshold ` +- :ref:`ALM-43010 Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43011 Non-Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43012 Direct Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43013 JDBCServer2x Process GC Time Exceeds the Threshold ` +- :ref:`ALM-43017 JDBCServer2x Process Full GC Number Exceeds the Threshold ` +- :ref:`ALM-43018 JobHistory2x Process Full GC Number Exceeds the Threshold ` +- :ref:`ALM-43019 Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43020 Non-Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43021 Direct Memory Usage of the IndexServer2x Process Exceeds the Threshold ` +- :ref:`ALM-43022 IndexServer2x Process GC Time Exceeds the Threshold ` +- :ref:`ALM-43023 IndexServer2x Process Full GC Number Exceeds the Threshold ` +- :ref:`ALM-44004 Presto Coordinator Resource Group Queuing Tasks Exceed the Threshold ` +- :ref:`ALM-44005 Presto Coordinator Process GC Time Exceeds the Threshold ` +- :ref:`ALM-44006 Presto Worker Process GC Time Exceeds the Threshold ` +- :ref:`ALM-45175 Average Time for Calling OBS Metadata APIs Is Greater than the Threshold ` +- :ref:`ALM-45176 Success Rate of Calling OBS Metadata APIs Is Lower than the Threshold ` +- :ref:`ALM-45177 Success Rate of Calling OBS Data Read APIs Is Lower than the Threshold ` +- :ref:`ALM-45178 Success Rate of Calling OBS Data Write APIs Is Lower Than the Threshold ` +- :ref:`ALM-45179 Number of Failed OBS readFully API Calls Exceeds the Threshold ` +- :ref:`ALM-45180 Number of Failed OBS read API Calls Exceeds the Threshold ` +- :ref:`ALM-45181 Number of Failed OBS write API Calls Exceeds the Threshold ` +- :ref:`ALM-45182 Number of Throttled OBS API Calls Exceeds the Threshold ` +- :ref:`ALM-45275 Ranger Service Unavailable ` +- :ref:`ALM-45276 Abnormal RangerAdmin Status ` +- :ref:`ALM-45277 RangerAdmin Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45278 RangerAdmin Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45279 RangerAdmin Non Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45280 RangerAdmin GC Duration Exceeds the Threshold ` +- :ref:`ALM-45281 UserSync Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45282 UserSync Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45283 UserSync Non Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45284 UserSync Garbage Collection (GC) Time Exceeds the Threshold ` +- :ref:`ALM-45285 TagSync Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45286 TagSync Direct Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45287 TagSync Non Heap Memory Usage Exceeds the Threshold ` +- :ref:`ALM-45288 TagSync Garbage Collection (GC) Time Exceeds the Threshold ` +- :ref:`ALM-45425 ClickHouse Service Unavailable ` +- :ref:`ALM-45426 ClickHouse Service Quantity Quota Usage in ZooKeeper Exceeds the Threshold ` +- :ref:`ALM-45427 ClickHouse Service Capacity Quota Usage in ZooKeeper Exceeds the Threshold ` +- :ref:`ALM-45428 ClickHouse Disk I/O Exception ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + alm-12001_audit_log_dumping_failure + alm-12004_oldap_resource_abnormal + alm-12005_okerberos_resource_abnormal + alm-12006_node_fault + alm-12007_process_fault + alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes + alm-12011_manager_data_synchronization_exception_between_the_active_and_standby_nodes + alm-12014_partition_lost + alm-12015_partition_filesystem_readonly + alm-12016_cpu_usage_exceeds_the_threshold + alm-12017_insufficient_disk_capacity + alm-12018_memory_usage_exceeds_the_threshold + alm-12027_host_pid_usage_exceeds_the_threshold + alm-12028_number_of_processes_in_the_d_state_on_a_host_exceeds_the_threshold + alm-12033_slow_disk_fault + alm-12034_periodical_backup_failure + alm-12035_unknown_data_status_after_recovery_task_failure + alm-12038_monitoring_indicator_dumping_failure + alm-12039_active_standby_oms_databases_not_synchronized + alm-12040_insufficient_system_entropy + alm-12041_incorrect_permission_on_key_files + alm-12042_incorrect_configuration_of_key_files + alm-12045_read_packet_dropped_rate_exceeds_the_threshold + alm-12046_write_packet_dropped_rate_exceeds_the_threshold + alm-12047_read_packet_error_rate_exceeds_the_threshold + alm-12048_write_packet_error_rate_exceeds_the_threshold + alm-12049_network_read_throughput_rate_exceeds_the_threshold + alm-12050_network_write_throughput_rate_exceeds_the_threshold + alm-12051_disk_inode_usage_exceeds_the_threshold + alm-12052_tcp_temporary_port_usage_exceeds_the_threshold + alm-12053_host_file_handle_usage_exceeds_the_threshold + alm-12054_invalid_certificate_file + alm-12055_the_certificate_file_is_about_to_expire + alm-12057_metadata_not_configured_with_the_task_to_periodically_back_up_data_to_a_third-party_server + alm-12061_process_usage_exceeds_the_threshold + alm-12062_oms_parameter_configurations_mismatch_with_the_cluster_scale + alm-12063_unavailable_disk + alm-12064_host_random_port_range_conflicts_with_cluster_used_port + alm-12066_trust_relationships_between_nodes_become_invalid + alm-12067_tomcat_resource_is_abnormal + alm-12068_acs_resource_exception + alm-12069_aos_resource_exception + alm-12070_controller_resource_is_abnormal + alm-12071_httpd_resource_is_abnormal + alm-12072_floatip_resource_is_abnormal + alm-12073_cep_resource_is_abnormal + alm-12074_fms_resource_is_abnormal + alm-12075_pms_resource_is_abnormal + alm-12076_gaussdb_resource_is_abnormal + alm-12077_user_omm_expired + alm-12078_password_of_user_omm_expired + alm-12079_user_omm_is_about_to_expire + alm-12080_password_of_user_omm_is_about_to_expire + alm-12081user_ommdba_expired + alm-12082_user_ommdba_is_about_to_expire + alm-12083_password_of_user_ommdba_is_about_to_expire + alm-12084_password_of_user_ommdba_expired + alm-12085_service_audit_log_dump_failure + alm-12087_system_is_in_the_upgrade_observation_period + alm-12089_inter-node_network_is_abnormal + alm-12101_az_unhealthy + alm-12102_az_ha_component_is_not_deployed_based_on_dr_requirements + alm-12110_failed_to_get_ecs_temporary_ak_sk + alm-12180_suspended_disk_i_o + alm-13000_zookeeper_service_unavailable + alm-13001_available_zookeeper_connections_are_insufficient + alm-13002_zookeeper_direct_memory_usage_exceeds_the_threshold + alm-13003_gc_duration_of_the_zookeeper_process_exceeds_the_threshold + alm-13004_zookeeper_heap_memory_usage_exceeds_the_threshold + alm-13005_failed_to_set_the_quota_of_top_directories_of_zookeeper_components + alm-13006_znode_number_or_capacity_exceeds_the_threshold + alm-13007_available_zookeeper_client_connections_are_insufficient + alm-13008_zookeeper_znode_usage_exceeds_the_threshold + alm-13009_zookeeper_znode_capacity_usage_exceeds_the_threshold + alm-13010_znode_usage_of_a_directory_with_quota_configured_exceeds_the_threshold + alm-14000_hdfs_service_unavailable + alm-14001_hdfs_disk_usage_exceeds_the_threshold + alm-14002_datanode_disk_usage_exceeds_the_threshold + alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold + alm-14006_number_of_hdfs_files_exceeds_the_threshold + alm-14007_namenode_heap_memory_usage_exceeds_the_threshold + alm-14008_datanode_heap_memory_usage_exceeds_the_threshold + alm-14009_number_of_dead_datanodes_exceeds_the_threshold + alm-14010_nameservice_service_is_abnormal + alm-14011_datanode_data_directory_is_not_configured_properly + alm-14012_journalnode_is_out_of_synchronization + alm-14013_failed_to_update_the_namenode_fsimage_file + alm-14014_namenode_gc_time_exceeds_the_threshold + alm-14015_datanode_gc_time_exceeds_the_threshold + alm-14016_datanode_direct_memory_usage_exceeds_the_threshold + alm-14017_namenode_direct_memory_usage_exceeds_the_threshold + alm-14018_namenode_non-heap_memory_usage_exceeds_the_threshold + alm-14019_datanode_non-heap_memory_usage_exceeds_the_threshold + alm-14020_number_of_entries_in_the_hdfs_directory_exceeds_the_threshold + alm-14021_namenode_average_rpc_processing_time_exceeds_the_threshold + alm-14022_namenode_average_rpc_queuing_time_exceeds_the_threshold + alm-14023_percentage_of_total_reserved_disk_space_for_replicas_exceeds_the_threshold + alm-14024_tenant_space_usage_exceeds_the_threshold + alm-14025_tenant_file_object_usage_exceeds_the_threshold + alm-14026_blocks_on_datanode_exceed_the_threshold + alm-14027_datanode_disk_fault + alm-14028_number_of_blocks_to_be_supplemented_exceeds_the_threshold + alm-14029_number_of_blocks_in_a_replica_exceeds_the_threshold + alm-14030_hdfs_allows_write_of_single-replica_data + alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_maximum_number_allowed_exceeds_the_threshold + alm-16001_hive_warehouse_space_usage_exceeds_the_threshold + alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold + alm-16003_background_thread_usage_exceeds_the_threshold + alm-16004_hive_service_unavailable + alm-16005_the_heap_memory_usage_of_the_hive_process_exceeds_the_threshold + alm-16006_the_direct_memory_usage_of_the_hive_process_exceeds_the_threshold + alm-16007_hive_gc_time_exceeds_the_threshold + alm-16008_non-heap_memory_usage_of_the_hive_process_exceeds_the_threshold + alm-16009_map_number_exceeds_the_threshold + alm-16045_hive_data_warehouse_is_deleted + alm-16046_hive_data_warehouse_permission_is_modified + alm-16047_hiveserver_has_been_deregistered_from_zookeeper + alm-16048_tez_or_spark_library_path_does_not_exist + alm-17003_oozie_service_unavailable + alm-17004_oozie_heap_memory_usage_exceeds_the_threshold + alm-17005_oozie_non_heap_memory_usage_exceeds_the_threshold + alm-17006_oozie_direct_memory_usage_exceeds_the_threshold + alm-17007_garbage_collection_gc_time_of_the_oozie_process_exceeds_the_threshold + alm-18000_yarn_service_unavailable + alm-18002_nodemanager_heartbeat_lost + alm-18003_nodemanager_unhealthy + alm-18008_heap_memory_usage_of_resourcemanager_exceeds_the_threshold + alm-18009_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold + alm-18010_resourcemanager_gc_time_exceeds_the_threshold + alm-18011_nodemanager_gc_time_exceeds_the_threshold + alm-18012_jobhistoryserver_gc_time_exceeds_the_threshold + alm-18013_resourcemanager_direct_memory_usage_exceeds_the_threshold + alm-18014_nodemanager_direct_memory_usage_exceeds_the_threshold + alm-18015_jobhistoryserver_direct_memory_usage_exceeds_the_threshold + alm-18016_non_heap_memory_usage_of_resourcemanager_exceeds_the_threshold + alm-18017_non_heap_memory_usage_of_nodemanager_exceeds_the_threshold + alm-18018_nodemanager_heap_memory_usage_exceeds_the_threshold + alm-18019_non_heap_memory_usage_of_jobhistoryserver_exceeds_the_threshold + alm-18020_yarn_task_execution_timeout + alm-18021_mapreduce_service_unavailable + alm-18022_insufficient_yarn_queue_resources + alm-18023_number_of_pending_yarn_tasks_exceeds_the_threshold + alm-18024_pending_yarn_memory_usage_exceeds_the_threshold + alm-18025_number_of_terminated_yarn_tasks_exceeds_the_threshold + alm-18026_number_of_failed_yarn_tasks_exceeds_the_threshold + alm-19000_hbase_service_unavailable + alm-19006_hbase_replication_sync_failed + alm-19007_hbase_gc_time_exceeds_the_threshold + alm-19008_heap_memory_usage_of_the_hbase_process_exceeds_the_threshold + alm-19009_direct_memory_usage_of_the_hbase_process_exceeds_the_threshold + alm-19011_regionserver_region_number_exceeds_the_threshold + alm-19012_hbase_system_table_directory_or_file_lost + alm-19013_duration_of_regions_in_transaction_state_exceeds_the_threshold + alm-19014_capacity_quota_usage_on_zookeeper_exceeds_the_threshold_severely + alm-19015_quantity_quota_usage_on_zookeeper_exceeds_the_threshold + alm-19016_quantity_quota_usage_on_zookeeper_exceeds_the_threshold_severely + alm-19017_capacity_quota_usage_on_zookeeper_exceeds_the_threshold + alm-19018_hbase_compaction_queue_size_exceeds_the_threshold + alm-19019_number_of_hbase_hfiles_to_be_synchronized_exceeds_the_threshold + alm-19020_number_of_hbase_wal_files_to_be_synchronized_exceeds_the_threshold + alm-20002_hue_service_unavailable + alm-24000_flume_service_unavailable + alm-24001_flume_agent_exception + alm-24003_flume_client_connection_interrupted + alm-24004_exception_occurs_when_flume_reads_data + alm-24005_exception_occurs_when_flume_transmits_data + alm-24006_heap_memory_usage_of_flume_server_exceeds_the_threshold + alm-24007_flume_server_direct_memory_usage_exceeds_the_threshold + alm-24008_flume_server_non_heap_memory_usage_exceeds_the_threshold + alm-24009_flume_server_garbage_collection_gc_time_exceeds_the_threshold + alm-24010_flume_certificate_file_is_invalid_or_damaged + alm-24011_flume_certificate_file_is_about_to_expire + alm-24012_flume_certificate_file_has_expired + alm-24013_flume_monitorserver_certificate_file_is_invalid_or_damaged + alm-24014_flume_monitorserver_certificate_is_about_to_expire + alm-24015_flume_monitorserver_certificate_file_has_expired + alm-25000_ldapserver_service_unavailable + alm-25004_abnormal_ldapserver_data_synchronization + alm-25005_nscd_service_exception + alm-25006_sssd_service_exception + alm-25500_krbserver_service_unavailable + alm-26051_storm_service_unavailable + alm-26052_number_of_available_supervisors_of_the_storm_service_is_less_than_the_threshold + alm-26053_storm_slot_usage_exceeds_the_threshold + alm-26054_nimbus_heap_memory_usage_exceeds_the_threshold + alm-27001_dbservice_service_unavailable + alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes + alm-27004_data_inconsistency_between_active_and_standby_dbservices + alm-27005_database_connections_usage_exceeds_the_threshold + alm-27006_disk_space_usage_of_the_data_directory_exceeds_the_threshold + alm-27007_database_enters_the_read-only_mode + alm-38000_kafka_service_unavailable + alm-38001_insufficient_kafka_disk_capacity + alm-38002_kafka_heap_memory_usage_exceeds_the_threshold + alm-38004_kafka_direct_memory_usage_exceeds_the_threshold + alm-38005_gc_duration_of_the_broker_process_exceeds_the_threshold + alm-38006_percentage_of_kafka_partitions_that_are_not_completely_synchronized_exceeds_the_threshold + alm-38007_status_of_kafka_default_user_is_abnormal + alm-38008_abnormal_kafka_data_directory_status + alm-38009_busy_broker_disk_i_os_applicable_to_versions_later_than_mrs_3.1.0 + alm-38010_topics_with_single_replica + alm-43001_spark2x_service_unavailable + alm-43006_heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold + alm-43007_non-heap_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold + alm-43008_the_direct_memory_usage_of_the_jobhistory2x_process_exceeds_the_threshold + alm-43009_jobhistory2x_process_gc_time_exceeds_the_threshold + alm-43010_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold + alm-43011_non-heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold + alm-43012_direct_heap_memory_usage_of_the_jdbcserver2x_process_exceeds_the_threshold + alm-43013_jdbcserver2x_process_gc_time_exceeds_the_threshold + alm-43017_jdbcserver2x_process_full_gc_number_exceeds_the_threshold + alm-43018_jobhistory2x_process_full_gc_number_exceeds_the_threshold + alm-43019_heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold + alm-43020_non-heap_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold + alm-43021_direct_memory_usage_of_the_indexserver2x_process_exceeds_the_threshold + alm-43022_indexserver2x_process_gc_time_exceeds_the_threshold + alm-43023_indexserver2x_process_full_gc_number_exceeds_the_threshold + alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold + alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold + alm-44006_presto_worker_process_gc_time_exceeds_the_threshold + alm-45175_average_time_for_calling_obs_metadata_apis_is_greater_than_the_threshold + alm-45176_success_rate_of_calling_obs_metadata_apis_is_lower_than_the_threshold + alm-45177_success_rate_of_calling_obs_data_read_apis_is_lower_than_the_threshold + alm-45178_success_rate_of_calling_obs_data_write_apis_is_lower_than_the_threshold + alm-45179_number_of_failed_obs_readfully_api_calls_exceeds_the_threshold + alm-45180_number_of_failed_obs_read_api_calls_exceeds_the_threshold + alm-45181_number_of_failed_obs_write_api_calls_exceeds_the_threshold + alm-45182_number_of_throttled_obs_api_calls_exceeds_the_threshold + alm-45275_ranger_service_unavailable + alm-45276_abnormal_rangeradmin_status + alm-45277_rangeradmin_heap_memory_usage_exceeds_the_threshold + alm-45278_rangeradmin_direct_memory_usage_exceeds_the_threshold + alm-45279_rangeradmin_non_heap_memory_usage_exceeds_the_threshold + alm-45280_rangeradmin_gc_duration_exceeds_the_threshold + alm-45281_usersync_heap_memory_usage_exceeds_the_threshold + alm-45282_usersync_direct_memory_usage_exceeds_the_threshold + alm-45283_usersync_non_heap_memory_usage_exceeds_the_threshold + alm-45284_usersync_garbage_collection_gc_time_exceeds_the_threshold + alm-45285_tagsync_heap_memory_usage_exceeds_the_threshold + alm-45286_tagsync_direct_memory_usage_exceeds_the_threshold + alm-45287_tagsync_non_heap_memory_usage_exceeds_the_threshold + alm-45288_tagsync_garbage_collection_gc_time_exceeds_the_threshold + alm-45425_clickhouse_service_unavailable + alm-45426_clickhouse_service_quantity_quota_usage_in_zookeeper_exceeds_the_threshold + alm-45427_clickhouse_service_capacity_quota_usage_in_zookeeper_exceeds_the_threshold + alm-45428_clickhouse_disk_i_o_exception diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/configuring_audit_log_dumping.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/configuring_audit_log_dumping.rst new file mode 100644 index 0000000..30ebb7b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/configuring_audit_log_dumping.rst @@ -0,0 +1,71 @@ +:original_name: admin_guide_000086.html + +.. _admin_guide_000086: + +Configuring Audit Log Dumping +============================= + +Scenario +-------- + +The audit logs of FusionInsight Manager are stored in the database by default. If the audit logs are retained for a long time, the disk space of the data directory may be insufficient. To store audit logs to another archive server, administrators can set the required dump parameters to automatically dump these logs. This facilitates the management of audit logs. + +If you do not configure the audit log dumping, the system automatically saves the audit logs to a file when the number of audit logs reaches 100,000 pieces. The save path is **${BIGDATA_DATA_HOME} /dbdata_om/dumpData/iam/operatelog** on the active management node. The file name format is **OperateLog_store\_**\ *YY_MM_DD_HH_MM_SS*\ **.csv**. The maximum number of historical audit log files is 50. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Audit** > **Configuration**. + +#. Click the switch on the right of **Audit Log Dumping Flag**. + + **Audit Log Dump** is disabled by default. If |image1| is displayed, **Audit Log Dump** is enabled. + +#. Set the dump parameters based on information provided in :ref:`Table 1 ` + + .. _admin_guide_000086__table61365090: + + .. table:: **Table 1** Audit log dump parameters + + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | Parameter | Description | Value | + +=======================+============================================================================================================================================================================================================================================================+=============================================+ + | SFTP IP Mode | Mode of the destination IP address. The value can be **IPv4** or **IPv6**. | IPv4 | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | SFTP IP | SFTP server for storing dumped audit logs. You are advised to use the SFTP service based on SSH v2 to prevent security risks. | **192.168.10.51** (example value) | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | SFTP Port | Connection port of the SFTP server for storing dumped audit logs | **22** (example value) | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | Save Path | Path for storing audit logs on the SFTP server | **/opt/omm/oms/auditLog** (example value) | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | SFTP Username | User name for logging in to the SFTP server | **root** (example value) | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | SFTP Password | Password for logging in to the SFTP server | *Password for logging into the SFTP server* | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | SFTP Public key | Specifies the public key of the SFTP server. This parameter is optional. You are advised to set the public key of the SFTP server. Otherwise, security risks may exist. | ``-`` | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | Dumping Mode | Dump mode. Value options are as follows: | - By Quantity | + | | | - By Time | + | | - **By Quantity**: If the number of pieces of logs reaches the value of this parameter (**100000** by default), the logs are dumped. | | + | | - **By Time**: specifies the date when logs are dumped. The dumping frequency is once a year. | | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + | Dumping Date | This parameter is available only when **Dumping Mode** is set to **By time**. After you select a dump date, the system starts dumping on this date. The logs to be dumped include all the audit logs generated before January 1 00:00 of the current year. | 11-06 | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+ + + .. note:: + + If the SFTP public key is empty, the system displays a security risk warning. Evaluate the security risk and then save the configuration. + +#. Click **OK** to complete the settings. + + .. note:: + + Key fields in the audit log dump file are as follows: + + - **USERTYPE** indicates the user type. Value **0** indicates a human-machine user, and value **1** indicates a machine-machine user. + - **LOGLEVEL** indicates the security level. Value **0** indicates Critical, value **1** indicates Major, value **2** indicates Minor, and value **3** indicates Warning. + - **OPERATERESULT** indicates the operation result. Value **0** indicates that the operation is successful, and value **1** indicates that the operation is failed. + +.. |image1| image:: /_static/images/en-us_image_0263899218.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/index.rst new file mode 100644 index 0000000..0153645 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000084.html + +.. _admin_guide_000084: + +Audit +===== + +- :ref:`Overview ` +- :ref:`Configuring Audit Log Dumping ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + configuring_audit_log_dumping diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/overview.rst new file mode 100644 index 0000000..2299912 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/audit/overview.rst @@ -0,0 +1,35 @@ +:original_name: admin_guide_000085.html + +.. _admin_guide_000085: + +Overview +======== + +Scenario +-------- + +The **Audit** page displays the user operations on Manager. On this page, administrators can view historical user operations on Manager. For details about the audit information, see :ref:`Audit Logs `. + + +Overview +-------- + +Log in to FusionInsight Manager and choose **Audit**. The **Audit** page displays the operation type, risk level, start time, end time, user, source, host name, service, instance, and operation result. + +- You can select audit logs at the **Critical**, **Major**, **Minor**, or **Notice** level from the **All risk levels** drop-down list. +- In **Advanced Search**, you can set filter criteria to query audit logs. + + #. You can query audit logs by user management, cluster, service, and health in the **Operation Type** column. + #. In the **Service** column, you can select a service to query corresponding audit logs. + + .. note:: + + You can select **--** to search for audit logs using all other search criteria except services. + + #. You can query audit logs by operation result. Value options are **All**, **Successful**, **Failed**, and **Unknown**. + +- You can click |image1| to manually refresh the current page or click |image2| to filter the columns displayed in the page. +- Click **Export All** to export all audit information at a time. The audit information can be exported in **TXT** or **CSV** format. + +.. |image1| image:: /_static/images/en-us_image_0263899268.png +.. |image2| image:: /_static/images/en-us_image_0000001369925585.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_metadata.rst new file mode 100644 index 0000000..7ee2d46 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_metadata.rst @@ -0,0 +1,77 @@ +:original_name: admin_guide_000348.html + +.. _admin_guide_000348: + +Backing Up ClickHouse Metadata +============================== + +Scenario +-------- + +To ensure ClickHouse metadata security or before a major operation on ZooKeeper (such as upgrade or migration), you need to back up ClickHouse metadata. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up ClickHouse metadata. Both automatic and manual backup tasks are supported. + +.. important:: + + This function is supported only by MRS 3.1.0 or later. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. + +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. **Periodic** indicates that the backup task is periodically executed. **Manual** indicates that the backup task is manually executed. + + To create a periodic backup task, set the following parameters: + + - **Started**: indicates the time when the task is started for the first time. + - **Period**: indicates the task execution interval. The options include **Hours** and **Days**. + - **Backup Policy**: Only **Full backup every time** is supported. + +#. In **Configuration**, select **ClickHouse** under **Metadata and other data**. + +#. Set **Path Type** of **ClickHouse** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. + + The default storage directory is *Data storage path*\ **/LocalBackup/**, for example, **/srv/BigData/LocalBackup**. + + If you select this option, you need to set the maximum number of replicas to specify the number of backup file sets that can be retained in the backup directory. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + This value option is available only after you configure the environment by referring to :ref:`How Do I Configure the Environment When I Create a ClickHouse Backup Task on FusionInsight Manager and Set the Path Type to RemoteHDFS? `. + + You also need to configure the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster, for example, **hacluster**. You can obtain it from the **NameService Management** page of HDFS of the standby cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. The format of the backup file name is *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_service_data.rst new file mode 100644 index 0000000..e4b80df --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_clickhouse_service_data.rst @@ -0,0 +1,121 @@ +:original_name: admin_guide_000349.html + +.. _admin_guide_000349: + +Backing Up ClickHouse Service Data +================================== + +Scenario +-------- + +To ensure ClickHouse service data security routinely or before a major operation on ClickHouse (such as upgrade or migration), you need to back up ClickHouse service data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up ClickHouse service data. Both automatic and manual backup tasks are supported. + +.. important:: + + This function is supported only by MRS 3.1.0 or later. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- You have planned the backup type, period, object, and directory based on service requirements. +- The HDFS in the standby cluster has sufficient space. You are advised to save backup files in a custom directory. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **ClickHouse** under **Service Data**. + +#. Set **Path Type** of **ClickHouse** to a backup directory type. + + Currently, only the **RemoteHDFS** type is available. + + **RemoteHDFS**: indicates that backup files are stored in HDFS of the standby cluster. + + This value option is available only after you configure the environment by referring to :ref:`How Do I Configure the Environment When I Create a ClickHouse Backup Task on FusionInsight Manager and Set the Path Type to RemoteHDFS? `. + + You also need to configure the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster, for example, **hacluster**. You can obtain it from the **NameService Management** page of HDFS of the standby cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + +#. Set **Maximum Number of Recovery Points** to the number of snapshots that can be retained in the cluster. + +#. Set **Backup Content** to one or multiple ClickHouse tables to be backed up. + + You can select backup data using either of the following methods: + + - Adding a backup data file + + Click the name of a database in the navigation tree to show all the tables in the database, and select specified tables. + + - Selecting using regular expressions + + a. Click **Query Regular Expression**. + b. Enter the database where the ClickHouse tables are located in the first text box as prompted. The database must be the same as the existing database, for example, **default**. + c. Enter a regular expression in the second text box. Standard regular expressions are supported. For example, to get all tables in the database, enter **([\\s\\S]*?)**. To get tables named in the format of letters and digits, for example, **tb1**, enter **tb\\d\***. + d. Click **Refresh** to view the displayed tables in **Directory Name**. + e. Click **Synchronize** to save the result. + + .. note:: + + - When entering regular expressions, click |image1| or |image2| to add or delete an expression. + - If the selected table or directory is incorrect, click **Clear Selected Node** to deselect it. + +#. Click **Verify** to check whether the backup task is configured correctly. + + The possible causes of the verification failure are as follows: + + - The target NameNode IP address is incorrect. + - The directory or table to be backed up does not exist. + - The name of the NameService is incorrect. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Data source_Task creation time*, and the subdirectory is used to save latest data source backup files. + +.. |image1| image:: /_static/images/en-us_image_0000001197336255.png +.. |image2| image:: /_static/images/en-us_image_0000001151336594.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_dbservice_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_dbservice_data.rst new file mode 100644 index 0000000..3163e90 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_dbservice_data.rst @@ -0,0 +1,148 @@ +:original_name: admin_guide_000203.html + +.. _admin_guide_000203: + +Backing Up DBService Data +========================= + +Scenario +-------- + +To ensure DBService service data security routinely or before a major operation on DBService (such as upgrade or migration), you need to back up DBService data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up DBService data. Both automatic and manual backup tasks are supported. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- If you want to back up data to OBS, you have connected the current cluster to OBS and have the permission to access OBS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **DBService**. + + .. note:: + + If there are multiple DBService services, all DBService services are backed up by default. You can click **Assign Service** to specify the services to be backed up. + +#. Set **Path Type** of **DBService** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. + + The default storage directory is *Data storage path*\ **/LocalBackup/**, for example, **/srv/BigData/LocalBackup**. + + If you select this option, you need to set the maximum number of replicas to specify the number of backup file sets that can be retained in the backup directory. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the HDFS directory for storing the backup files. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the source cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the OBS directory for storing backup data. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + .. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. + + The format of the backup file name is *Version*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_metadata.rst new file mode 100644 index 0000000..cea8006 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_metadata.rst @@ -0,0 +1,145 @@ +:original_name: admin_guide_000204.html + +.. _admin_guide_000204: + +Backing Up HBase Metadata +========================= + +Scenario +-------- + +To ensure HBase metadata security (including tableinfo files and HFiles) or before a major operation on HBase system tables (such as upgrade or migration), you need to back up HBase metadata to prevent HBase service unavailability caused by HBase system table directory or file damages. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up HBase metadata. Both automatic and manual backup tasks are supported. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. + +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. + +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. + +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- The **fs.defaultFS** parameter settings of HBase are the same as those of Yarn and HDFS. +- If HBase data is stored in the local HDFS, HBase metadata can be backed up to OBS. If HBase data is stored in OBS, data backup is not supported. +- If you want to back up data to OBS, you have connected the current cluster to OBS and have the permission to access OBS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **HBase** under **Metadata and other data**. + + .. note:: + + If there are multiple HBase services, all HBase services are backed up by default. You can click **Assign Service** to specify the services to be backed up. + +#. Set **Path Type** of **HBase** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. + + The default storage directory is *Data storage path*\ **/LocalBackup/**, for example, **/srv/BigData/LocalBackup**. + + If you select this option, you need to set the maximum number of replicas to specify the number of backup file sets that can be retained in the backup directory. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the source cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the OBS directory for storing backup data. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + .. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. The format of the backup file name is *Version*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_service_data.rst new file mode 100644 index 0000000..82a4846 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hbase_service_data.rst @@ -0,0 +1,173 @@ +:original_name: admin_guide_000205.html + +.. _admin_guide_000205: + +Backing Up HBase Service Data +============================= + +Scenario +-------- + +To ensure HBase service data security routinely or before a major operation on HBase (such as upgrade or migration), you need to back up HBase service data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up HBase service data. Both automatic and manual backup tasks are supported. + +The following situations may occur during the HBase service data backup: + +- When a user creates an HBase table, **KEEP_DELETED_CELLS** is set to **false** by default. When the user backs up this HBase table, deleted data will be backed up and junk data may exist after data restoration. This parameter can be set to **true** manually when an HBase table is created based on service requirements. +- When a user manually specifies the timestamp when writing data into an HBase table and the specified time is earlier than the last backup time of the HBase table, new data may not be backed up in incremental backup tasks. +- The HBase backup function cannot back up the access control lists (ACLs) for reading, writing, executing, creating, and managing HBase global or namespaces. After HBase data is restored, you need to reset the role permissions on FusionInsight Manager. +- If the backup data of the standby cluster is lost in an existing HBase backup task, the next incremental backup will fail, and you need to create an HBase backup task again. However, the next full backup task will be normal. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. + +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- Backup policies, including the backup task type, period, backup object, backup directory, and Yarn queue required by the backup task are planned based on service requirements. +- The HDFS in the standby cluster has sufficient space. You are advised to save backup files in a custom directory. +- On the HDFS client, you have executed the **hdfs lsSnapshottableDir** command as user **hdfs** to check the list of directories for which HDFS snapshots have been created in the current cluster and ensured that the HDFS parent directory or subdirectory where data files to be backed up are stored does not have HDFS snapshots. Otherwise, the backup task cannot be created. +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- The **fs.defaultFS** parameter settings of HBase are the same as those of Yarn and HDFS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, choose **HBase** > **HBase** under **Service data**. + +#. Set **Path Type** of **HBase** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + +#. Set **Maximum Number of Recovery Points** to the number of snapshots that can be retained in the cluster. + +#. Set **Backup Content** to one or multiple HBase tables to be backed up. + + You can select backup data using either of the following methods: + + - Adding a backup data file + + Click the name of a database in the navigation tree to show all the tables in the database, and select specified tables. + + - Selecting using regular expressions + + a. Click **Query Regular Expression**. + b. Enter the namespace where the HBase tables are located in the first text box as prompted. The namespace must be the same as the existing namespace, for example, **default**. + c. Enter a regular expression in the second text box. Standard regular expressions are supported. For example, to get all tables in the namespace, enter **([\\s\\S]*?)**. To get tables whose names consist of letters and digits, for example, **tb1**, enter **tb\\d\***. + d. Click **Refresh** to view the displayed tables in **Directory Name**. + e. Click **Synchronize** to save the result. + + .. note:: + + - When entering regular expressions, click |image1| or |image2| to add or delete an expression. + - If the selected table or directory is incorrect, click **Clear Selected Node** to deselect it. + +#. Click **Verify** to check whether the backup task is configured correctly. + + The possible causes of the verification failure are as follows: + + - The target NameNode IP address is incorrect. + - The queue name is incorrect. + - The parent directory or subdirectory of the HDFS directory where HBase table data files to be backed up are stored has HDFS snapshots. + - The directory or table to be backed up does not exist. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name_Data source_Task creation time*, and the subdirectory is used to save latest data source backup files. All the backup file sets are stored in the related snapshot directories. + +.. |image1| image:: /_static/images/en-us_image_0263899673.png +.. |image2| image:: /_static/images/en-us_image_0263899411.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hdfs_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hdfs_service_data.rst new file mode 100644 index 0000000..2debb7f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hdfs_service_data.rst @@ -0,0 +1,174 @@ +:original_name: admin_guide_000209.html + +.. _admin_guide_000209: + +Backing Up HDFS Service Data +============================ + +Scenario +-------- + +To ensure HDFS service data security routinely or before a major operation on HDFS (such as upgrade or migration), you need to back up HDFS service data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up HDFS service data. Both automatic and manual backup tasks are supported. + +.. note:: + + Encrypted directories cannot be backed up or restored. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. + +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- Backup policies, including the backup task type, period, backup object, backup directory, and Yarn queue required by the backup task are planned based on service requirements. +- The HDFS in the standby cluster has sufficient space. You are advised to save backup files in a custom directory. +- On the HDFS client, you have executed the **hdfs lsSnapshottableDir** command as user **hdfs** to check the list of directories for which HDFS snapshots have been created in the current cluster and ensured that the HDFS parent directory or subdirectory where data files to be backed up are stored does not have HDFS snapshots. Otherwise, the backup task cannot be created. +- If you want to back up data to NAS, you have deployed the NAS server in advance. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **HDFS**. + +#. Set **Path Type** of **HDFS** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + +#. Set **Maximum Number of Recovery Points** to the number of snapshots that can be retained in the cluster. + +#. Set **Backup Content** to one or multiple HDFS directories to be backed up based on service requirements. + + You can select backup data using either of the following methods: + + - Adding a backup data file + + Click the name of a database in the navigation tree to show all the tables in the database, and select specified tables. + + - Selecting using regular expressions + + a. Click **Query Regular Expression**. + b. Enter the parent directory full path of the directory in the first text box as prompted. The directory must be the same as the existing directory, for example, **/tmp**. + c. Enter a regular expression in the second text box. Standard regular expressions are supported. For example, to get all files or subdirectories in the parent directory, enter **([\\s\\S]*?)**. To get files whose names consist of letters and digits, for example, **file\ 1**, enter **file\\d\***. + d. Click **Refresh** to view the displayed directories in **Directory Name**. + e. Click **Synchronize** to save the result. + + .. note:: + + - When entering regular expressions, click |image1| or |image2| to add or delete an expression. + - If the selected table or directory is incorrect, click **Clear Selected Node** to deselect it. + - The backup directory cannot contain files that have been written for a long time. Otherwise, the backup task will fail. Therefore, you are not advised to perform operations on the top-level directory, such as **/user**, **/tmp**, and **/mr-history**. + +#. Click **Verify** to check whether the backup task is configured correctly. + + The possible causes of the verification failure are as follows: + + - The target NameNode IP address is incorrect. + - The queue name is incorrect. + - The parent directory or subdirectory of the HDFS directory where data files to be backed up are stored has HDFS snapshots. + - The directory or table to be backed up does not exist. + - The name of the NameService is incorrect. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name_Data source_Task creation time*, and the subdirectory is used to save latest data source backup files. All the backup file sets are stored in the related snapshot directories. + +.. |image1| image:: /_static/images/en-us_image_0263899520.png +.. |image2| image:: /_static/images/en-us_image_0263899283.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hive_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hive_service_data.rst new file mode 100644 index 0000000..6fd297f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_hive_service_data.rst @@ -0,0 +1,167 @@ +:original_name: admin_guide_000210.html + +.. _admin_guide_000210: + +Backing Up Hive Service Data +============================ + +Scenario +-------- + +To ensure Hive service data security routinely or before a major operation on Hive (such as upgrade or migration), you need to back up Hive service data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up Hive service data. Both automatic and manual backup tasks are supported. + +- Hive backup and restoration cannot identify the service and structure relationships of objects such as Hive tables, indexes, and views. When executing backup and restoration tasks, you need to manage unified restoration points based on service scenarios to ensure proper service running. +- Hive backup and restoration do not support Hive on RDB data tables. You need to back up and restore original data tables in external databases independently. +- If the backup data of the standby cluster is lost in an existing Hive backup task that contains Hive on HBase tables, the next incremental backup will fail, and you need to create a Hive backup task again. However, the next full backup task will be normal. +- After the backup function of FusionInsight Manager is used to back up the HDFS directories at the Hive table level, the Hive tables cannot be deleted and recreated. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- Backup policies, including the backup task type, period, backup object, backup directory, and Yarn queue required by the backup task are planned based on service requirements. +- The HDFS in the standby cluster has sufficient space. You are advised to save backup files in a custom directory. +- On the HDFS client, you have executed the **hdfs lsSnapshottableDir** command as user **hdfs** to check the list of directories for which HDFS snapshots have been created in the current cluster and ensured that the HDFS parent directory or subdirectory where data files to be backed up are stored does not have HDFS snapshots. Otherwise, the backup task cannot be created. +- If you want to back up data to NAS, you have deployed the NAS server in advance. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, choose **Hive** > **Hive**. + +#. Set **Path Type** of **Hive** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + +#. Set **Maximum Number of Recovery Points** to the number of snapshots that can be retained in the cluster. + +#. Set **Backup Content** to one or multiple Hive tables to be backed up. + + You can select backup data using either of the following methods: + + - Adding a backup data file + + Click the name of a database in the navigation tree to show all the tables in the database, and select specified tables. + + - Selecting using regular expressions + + a. Click **Query Regular Expression**. + b. Enter the database where the Hive tables are located in the first text box as prompted. The database must be the same as the existing database, for example, **default**. + c. Enter a regular expression in the second text box. Standard regular expressions are supported. For example, to get all tables in the database, enter **([\\s\\S]*?)**. To get tables whose names consist of letters and digits, for example, **tb1**, enter **tb\\d\***. + d. Click **Refresh** to view the displayed tables in **Directory Name**. + e. Click **Synchronize** to save the result. + + .. note:: + + - When entering regular expressions, click |image1| or |image2| to add or delete an expression. + - If the selected table or directory is incorrect, click **Clear Selected Node** to deselect it. + +#. Click **Verify** to check whether the backup task is configured correctly. + + The possible causes of the verification failure are as follows: + + - The target NameNode IP address is incorrect. + - The queue name is incorrect. + - The parent directory or subdirectory of the HDFS directory where data files to be backed up are stored has HDFS snapshots. + - The directory or table to be backed up does not exist. + - The name of the NameService is incorrect. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name_Data source_Task creation time*, and the subdirectory is used to save latest data source backup files. All the backup file sets are stored in the related snapshot directories. + +.. |image1| image:: /_static/images/en-us_image_0263899520.png +.. |image2| image:: /_static/images/en-us_image_0263899283.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_kafka_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_kafka_metadata.rst new file mode 100644 index 0000000..74a05e2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_kafka_metadata.rst @@ -0,0 +1,133 @@ +:original_name: admin_guide_000211.html + +.. _admin_guide_000211: + +Backing Up Kafka Metadata +========================= + +Scenario +-------- + +To ensure Kafka metadata security or before a major operation on ZooKeeper (such as upgrade or migration), you need to back up Kafka metadata. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up Kafka metadata. Both automatic and manual backup tasks are supported. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. + +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- If you want to back up data to OBS, you have connected the current cluster to OBS and have the permission to access OBS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **Kafka**. + + .. note:: + + If there are multiple Kafka services, all Kafka services are backed up by default. You can click **Assign Service** to specify the services to be backed up. + +#. Set **Path Type** of **Kafka** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. By default, the backup files are stored in *Data storage path*\ **/LocalBackup/**. + + If you select this option, you need to set the maximum number of replicas to specify the number of backup file sets that can be retained in the backup directory. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the HDFS directory for storing the backup files. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the OBS directory for storing backup data. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + .. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. The format of the backup file name is *Version*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_manager_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_manager_data.rst new file mode 100644 index 0000000..ecf1f2b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_manager_data.rst @@ -0,0 +1,144 @@ +:original_name: admin_guide_000202.html + +.. _admin_guide_000202: + +Backing Up Manager Data +======================= + +Scenario +-------- + +To ensure data security of FusionInsight Manager routinely or before and after a critical operation (such as capacity expansion and reduction) on FusionInsight Manager, you need to back up FusionInsight Manager data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up Manager data. Both automatic and manual backup tasks are supported. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- If you want to back up data to OBS, you have connected the current cluster to OBS and have the permission to access OBS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Set **Backup Object** to **OMS**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | - **Full backup at the first time and incremental backup subsequently** | + | | - **Full backup every time** | + | | - **Full backup once every n times** | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **OMS**. + +#. Set **Path Type** of **OMS** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. + + The default storage directory is *Data storage path*\ **/LocalBackup/**, for example, **/srv/BigData/LocalBackup**. + + If you select this option, you need to set the maximum number of replicas to specify the number of backup file sets that can be retained in the backup directory. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the HDFS directory for storing the backup files. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Cluster for Backup**: Enter the cluster name mapping to the backup directory. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Target Path**: indicates the HDFS directory for storing standby cluster backup data. The storage path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory, such as **/hbase** or **/user/hbase/backup**. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **Source Cluster**: Select the cluster of the Yarn queue used by the backup data. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the source cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the OBS directory for storing backup data. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + .. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. + + The format of the backup file name is *Version*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_namenode_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_namenode_data.rst new file mode 100644 index 0000000..7dd45ab --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/backing_up_namenode_data.rst @@ -0,0 +1,132 @@ +:original_name: admin_guide_000208.html + +.. _admin_guide_000208: + +Backing Up NameNode Data +======================== + +Scenario +-------- + +To ensure NameNode service data security routinely or before a major operation on NameNode (such as upgrade or migration), you need to back up NameNode data. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +You can create a backup task on FusionInsight Manager to back up NameNode data. Both automatic and manual backup tasks are supported. + +Prerequisites +------------- + +- If data needs to be backed up to the remote HDFS, you have prepared a standby cluster for data backup. The authentication mode of the standby cluster is the same as that of the active cluster. For other backup modes, you do not need to prepare the standby cluster. +- If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. + +- If you want to back up data to NAS, you have deployed the NAS server in advance. +- If you want to back up data to OBS, you have connected the current cluster to OBS and have the permission to access OBS. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. Click **Create**. + +#. Set **Name** to the name of the backup task. + +#. Select the cluster to be operated from **Backup Object**. + +#. Set **Mode** to the type of the backup task. + + **Periodic** indicates that the backup task is executed by the system periodically. **Manual** indicates that the backup task is executed manually. + + .. table:: **Table 1** Periodic backup parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================+ + | Started | Indicates the time when the task is started for the first time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Period | Indicates the task execution interval. The options include **Hours** and **Days**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Backup Policy | Only **Full backup every time** is supported. | + | | | + | | .. note:: | + | | | + | | - Incremental backup is not supported when Manager data and component metadata are backed up. Only **Full backup every time** is supported. | + | | - If **Path Type** is set to **NFS** or **CIFS**, incremental backup cannot be used. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. In **Configuration**, select **NameNode**. + +#. Set **Path Type** of **NameNode** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. By default, the backup files are stored in *Data storage path*\ **/LocalBackup/**. + + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. If you select this option, set the following parameters: + + - **Destination NameService Name**: indicates the NameService name of the standby cluster. You can set it to the NameService name (**haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**) of the built-in remote cluster of the cluster, or the NameService name of a configured remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Target NameNode IP Address**: indicates the service plane IP address of the NameNode in the standby cluster. + - **Target Path**: indicates the path for storing backup files. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Server Shared Path**: indicates the configured shared directory of the NAS server. (The shared path of the server cannot be set to the root directory, and the user group and owner group of the shared path must be **nobody:nobody**.) + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select this option, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Server Shared Path**: indicates the backup path on the SFTP server. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select this option, set the following parameters: + + - **Target Path**: indicates the OBS directory for storing backup data. + - **Maximum Number of Backup Copies**: indicates the number of backup file sets that can be retained in the backup directory. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + .. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +#. Click **OK**. + +#. In the **Operation** column of the created task in the backup task list, click **More** and select **Back Up Now** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. + + The format of the backup file name is *Version*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/index.rst new file mode 100644 index 0000000..b13b799 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/backing_up_data/index.rst @@ -0,0 +1,32 @@ +:original_name: admin_guide_000201.html + +.. _admin_guide_000201: + +Backing Up Data +=============== + +- :ref:`Backing Up Manager Data ` +- :ref:`Backing Up ClickHouse Metadata ` +- :ref:`Backing Up ClickHouse Service Data ` +- :ref:`Backing Up DBService Data ` +- :ref:`Backing Up HBase Metadata ` +- :ref:`Backing Up HBase Service Data ` +- :ref:`Backing Up NameNode Data ` +- :ref:`Backing Up HDFS Service Data ` +- :ref:`Backing Up Hive Service Data ` +- :ref:`Backing Up Kafka Metadata ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + backing_up_manager_data + backing_up_clickhouse_metadata + backing_up_clickhouse_service_data + backing_up_dbservice_data + backing_up_hbase_metadata + backing_up_hbase_service_data + backing_up_namenode_data + backing_up_hdfs_service_data + backing_up_hive_service_data + backing_up_kafka_metadata diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/enabling_cross-cluster_replication.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/enabling_cross-cluster_replication.rst new file mode 100644 index 0000000..38f1062 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/enabling_cross-cluster_replication.rst @@ -0,0 +1,53 @@ +:original_name: admin_guide_000200.html + +.. _admin_guide_000200: + +Enabling Cross-Cluster Replication +================================== + +Scenario +-------- + +DistCp is used to replicate the data stored in HDFS from a cluster to another cluster. DistCp depends on the cross-cluster replication function, which is disabled by default. You need to enable it for both clusters. + +This section describes how to modify parameters on FusionInsight Manager to enable the cross-cluster replication function. After this function is enabled, you can create a backup task for backing up data to the remote HDFS (RemoteHDFS). + +Impact on the System +-------------------- + +Yarn needs to be restarted to enable the cross-cluster replication function and cannot be accessed during restart. + +Prerequisites +------------- + +- The **hadoop.rpc.protection** parameter of HDFS in the two clusters for data replication must use the same data transmission mode. The default value is **privacy**, indicating encrypted transmission. The value **authentication** indicates that transmission is not encrypted. +- For clusters in security mode, you need to configure mutual trust between clusters. + +Procedure +--------- + +#. Log in to FusionInsight Manager of one of the two clusters. + +#. .. _admin_guide_000200__en-us_topic_0046736761_li45131484: + + Choose **Cluster** > ****Name of the desired cluster** > Services** > **Yarn** > **Configurations**, and click **All Configurations**. + +#. In the navigation pane, choose **Yarn** > **Distcp**. + +#. Modify **dfs.namenode.rpc-address**, set **haclusterX.remotenn1** to the service IP address and RPC port of one NameNode instance of the peer cluster, and set **haclusterX.remotenn2** to the service IP address and RPC port number of the other NameNode instance of the peer cluster. + + **haclusterX.remotenn1** and **haclusterX.remotenn2** do not distinguish active and standby NameNodes. The default NameNode RPC port is 8020 and cannot be modified on Manager. + + Examples of modified parameter values: **10.1.1.1:8020** and **10.1.1.2:8020**. + + .. note:: + + - If data of the current cluster needs to be backed up to the HDFS of multiple clusters, you can configure the corresponding NameNode RPC addresses to haclusterX1, haclusterX2, haclusterX3, and haclusterX4. + +#. Click **Save**. In the confirmation dialog box, click **OK**. + +#. .. _admin_guide_000200__en-us_topic_0046736761_li8920825: + + Restart the Yarn service. + +#. Log in to FusionInsight Manager of the other cluster and repeat :ref:`2 ` to :ref:`6 `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/how_do_i_configure_the_environment_when_i_create_a_clickhouse_backup_task_on_fusioninsight_manager_and_set_the_path_type_to_remotehdfs.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/how_do_i_configure_the_environment_when_i_create_a_clickhouse_backup_task_on_fusioninsight_manager_and_set_the_path_type_to_remotehdfs.rst new file mode 100644 index 0000000..df9d767 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/how_do_i_configure_the_environment_when_i_create_a_clickhouse_backup_task_on_fusioninsight_manager_and_set_the_path_type_to_remotehdfs.rst @@ -0,0 +1,36 @@ +:original_name: admin_guide_000357.html + +.. _admin_guide_000357: + +How Do I Configure the Environment When I Create a ClickHouse Backup Task on FusionInsight Manager and Set the Path Type to RemoteHDFS? +======================================================================================================================================= + +.. note:: + + This section applies only to MRS 3.1.0. + +Question +-------- + +How do I configure the environment when I create a ClickHouse backup task on FusionInsight Manager and set the path type to RemoteHDFS? + +Answer +------ + +#. Log in to FusionInsight Manager of the standby cluster. + +#. Choose **Cluster** > **Services** > **HDFS** and choose **More** > **Download Client**. Set **Select Client Type** to **Configuration Files Only**, select **x86_64** for x86 or **aarch64** for ARM based on the type of the node where the client is to be installed, and click **OK**. + +#. After the client file package is generated, download the client to the local PC as prompted and decompress the package. + + For example, if the client file package is **FusionInsight_Cluster_1_HDFS_Client.tar**, decompress it to obtain **FusionInsight_Cluster_1_HDFS_ClientConfig_ConfigFiles.tar**, and then decompress **FusionInsight_Cluster_1_HDFS_ClientConfig_ConfigFiles.tar** to the **D:\\FusionInsight_Cluster_1_HDFS_ClientConfig_ConfigFiles** directory on the local PC. The directory name cannot contain spaces. + +#. .. _admin_guide_000357__li73411714152720: + + Go to the **FusionInsight_Cluster_1_HDFS_ClientConfig_ConfigFiles\\** client directory and obtain the **hosts** file. + +#. Log in to FusionInsight Manager of the source cluster. + +#. Choose **Cluster** > **Services** > **ClickHouse**, click **Instance**, and view the instance IP address of **ClickHouseServer**. + +#. Log in to the host nodes of the ClickHouseServer instances as user **root** and check whether the **/etc/hosts** file contains the host information in :ref:`4 `. If not, add the host information in :ref:`4 ` to the **/etc/hosts** file. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/index.rst new file mode 100644 index 0000000..8c9aa13 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/index.rst @@ -0,0 +1,28 @@ +:original_name: admin_guide_000198.html + +.. _admin_guide_000198: + +Backup and Recovery Management +============================== + +- :ref:`Introduction ` +- :ref:`Backing Up Data ` +- :ref:`Recovering Data ` +- :ref:`Enabling Cross-Cluster Replication ` +- :ref:`Managing Local Quick Restoration Tasks ` +- :ref:`Modifying a Backup Task ` +- :ref:`Viewing Backup and Restoration Tasks ` +- :ref:`How Do I Configure the Environment When I Create a ClickHouse Backup Task on FusionInsight Manager and Set the Path Type to RemoteHDFS? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction + backing_up_data/index + recovering_data/index + enabling_cross-cluster_replication + managing_local_quick_restoration_tasks + modifying_a_backup_task + viewing_backup_and_restoration_tasks + how_do_i_configure_the_environment_when_i_create_a_clickhouse_backup_task_on_fusioninsight_manager_and_set_the_path_type_to_remotehdfs diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/introduction.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/introduction.rst new file mode 100644 index 0000000..26b29f0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/introduction.rst @@ -0,0 +1,222 @@ +:original_name: admin_guide_000399.html + +.. _admin_guide_000399: + +Introduction +============ + +Overview +-------- + +FusionInsight Manager provides the backup and restoration of system data and user data by component. The system can back up Manager data, component metadata, and service data. + +Data can be backed up to local disks (LocalDir), local HDFS (LocalHDFS), remote HDFS (RemoteHDFS), NAS (NFS/CIFS), Object Storage Service (OBS), and SFTP server (SFTP). For details, see :ref:`Backing Up Data `. + +For a component that supports multiple services, multiple instances of a service can be backed up and restored. The backup and restoration operations are consistent with those of a service instance. + +.. note:: + + Only MRS 3.1.0 or later supports data backup to OBS. + +Backup and restoration tasks are performed in the following scenarios: + +- Routine backup is performed to ensure the data security of the system and components. +- If the system is faulty, the data backup can be used to recover the system. +- If the active cluster is completely faulty, a mirrored cluster identical to the active cluster needs to be created. You can use the backup data to restore the active cluster. + +.. table:: **Table 1** Manager configuration data to be backed up + + +-----------------------+---------------------------------------------------------------------------------------------------------+-----------------------+ + | Backup Type | Backup Content | Backup Directory Type | + +=======================+=========================================================================================================+=======================+ + | OMS | Database data (excluding alarm data) and configuration data in the cluster management system by default | - LocalDir | + | | | - LocalHDFS | + | | | - RemoteHDFS | + | | | - NFS | + | | | - CIFS | + | | | - SFTP | + | | | - OBS | + +-----------------------+---------------------------------------------------------------------------------------------------------+-----------------------+ + +.. table:: **Table 2** Component metadata or other data to be backed up + + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Backup Type | Backup Content | Backup Directory Type | + +=======================+=====================================================================================================================================================================================================================+=======================+ + | DBService | Metadata of the components (including Loader, Hive, Spark, Oozie, and Hue) managed by DBService. For a cluster with multiple services installed, back up the metadata of multiple Hive and Spark service instances. | - LocalDir | + | | | - LocalHDFS | + | | | - RemoteHDFS | + | | | - NFS | + | | | - CIFS | + | | | - SFTP | + | | | - OBS | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Kafka | Kafka metadata. | - LocalDir | + | | | - LocalHDFS | + | | | - RemoteHDFS | + | | | - NFS | + | | | - CIFS | + | | | - OBS | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | NameNode | HDFS metadata. After multiple NameServices are added, backup and restoration are supported for all of them and the operations are consistent with those of the default hacluster instance. | - LocalDir | + | | | - RemoteHDFS | + | | | - NFS | + | | | - CIFS | + | | | - SFTP | + | | | - OBS | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Yarn | Information about the Yarn service resource pool. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | HBase | **tableinfo** files and data files of HBase system tables. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +.. table:: **Table 3** Service data of specific components to be backed up + + +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Backup Type | Backup Content | Backup Directory Type | + +=======================+==========================================================================================================================================================================================================================================================+=======================+ + | HBase | Table-level user data. For a cluster with multiple services installed, backup and restoration are supported for multiple HBase service instances and the backup and restoration operations are consistent with those of a single HBase service instance. | - RemoteHDFS | + | | | - NFS | + | | | - CIFS | + | | | - SFTP | + +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | HDFS | Directories or files of user services. | | + | | | | + | | .. note:: | | + | | | | + | | Encrypted directories cannot be backed up or restored. | | + +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Hive | Table-level user data. For a cluster with multiple services installed, backup and restoration are supported for multiple Hive service instances and the backup and restoration operations are consistent with those of a single Hive service instance. | | + +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +Note that some components do not provide data backup or restoration: + +- Kafka supports replicas and allows multiple replicas to be specified when a topic is created. +- MapReduce and Yarn data is stored in HDFS. Therefore, they rely on the backup and restoration provided by HDFS. +- Backup and restoration of service data in ZooKeeper are performed by their own upper-layer components. + +Principles +---------- + +**Task** + +Before backup or restoration, you need to create a backup or restoration task and set task parameters, such as the task name, backup data source, and type of the directory for storing backup files. Then you can execute the tasks to back up or restore data. When Manager is used to restore the data of HDFS, HBase, Hive, and NameNode, the cluster cannot be accessed. + +Each backup task can back up data of different data sources and generate an independent backup file for each data source. All the backup files generated in a backup task form a backup file set, which can be used in restoration tasks. Backup data can be stored on Linux local disks, local cluster HDFS, and standby cluster HDFS. + +Backup tasks support full backup and incremental backup policies. Cloud data backup tasks do not support incremental backup. If the backup directory type is NFS or CIFS, incremental backup is not recommended. When incremental backup is used for NFS or CIFS backup, the latest full backup data is updated each time the incremental backup is performed. Therefore, no new recovery point is generated. + +.. note:: + + Task execution rules: + + - If a task is being executed, the task cannot be executed repeatedly and other tasks cannot be started at the same time. + - The interval at which a periodic task is automatically executed must be greater than 120s. Otherwise, the task is postponed and will be executed in the next period. Manual tasks can be executed at any interval. + - When a periodic task is to be automatically executed, the current time cannot be 120s later than the task start time. Otherwise, the task is postponed and executed in the next period. + - When a periodic task is locked, it cannot be automatically executed and needs to be manually unlocked. + - Before an OMS, DBService, Kafka, or NameNode backup task starts, ensure that the LocalBackup partition on the active management node has not less than 20 GB of available space. Otherwise, the backup task cannot be started. + +When planning backup and restoration tasks, select the data to be backed up or restored strictly based on the service logic, data store structure, and database or table association. By default, the system creates periodic backup tasks **default-oms** and **default-cluster ID** at an interval of one hour. OMS metadata and cluster metadata, such as DBService and NameNode, can be fully backed up to local disks. + +**Snapshot** + +The system uses the snapshot technology to quickly back up data. Snapshots include HBase and HDFS snapshots. + +- HBase snapshots + + An HBase snapshot is a backup file of HBase tables at a specified time point. This backup file does not replicate service data or affect the RegionServer. The HBase snapshot replicates table metadata, including table descriptor, region info, and HFile reference information. The metadata can be used to restore data before the snapshot creation time. + +- HDFS snapshots + + An HDFS snapshot is a read-only backup of HDFS at a specified time point. The snapshot is used in data backup, misoperation protection, and disaster recovery scenarios. + + The snapshot function can be enabled for any HDFS directory to create the related snapshot file. Before creating a snapshot for a directory, the system automatically enables the snapshot function for the directory. Creating a snapshot does not affect any HDFS operation. A maximum of 65,536 snapshots can be created for each HDFS directory. + + When a snapshot is being created for an HDFS directory, the directory cannot be deleted or modified before the snapshot is created. Snapshots cannot be created for the upper-layer directories or subdirectories of the directory. + +**DistCp** + +Distributed copy (DistCp) is a tool used to replicate a large amount of data in HDFS in a cluster or between the HDFSs of different clusters. In a backup or restoration task of HBase, HDFS, or Hive, if you back up the data to HDFS of the standby cluster, the system invokes DistCp to perform the operation. Install the MRS software of the same version for the active and standby clusters and install the cluster. + +DistCp uses MapReduce to implement data distribution, troubleshooting, restoration, and report. DistCp specifies different Map jobs for various source files and directories in the specified list. Each Map job copies the data in the partition that corresponds to the specified file in the list. + +If you use DistCp to replicate data between HDFSs of two clusters, configure the cross-cluster mutual trust (mutual trust does not need to be configured for clusters managed by the same FusionInsight Manager) and cross-cluster replication for both clusters. When backing up the cluster data to HDFS in another cluster, you need to install the Yarn component. Otherwise, the backup fails. + +**Local rapid restoration** + +After using DistCp to back up the HBase, HDFS, and Hive data of the local cluster to the HDFS of the standby cluster, the HDFS of the local cluster retains the backup data snapshots. You can create local rapid restoration tasks to restore data by using the snapshot files in the HDFS of the local cluster. + +**NAS** + +Network Attached Storage (NAS) is a dedicated data storage server which includes the storage components and embedded system software. It provides the cross-platform file sharing function. By using NFS (supporting NFSv3 and NFSv4) and CIFS (supporting SMBv2 and SMBv3), you can connect the service plane of MRS to the NAS server to back up data to the NAS or restore data from the NAS. + +.. note:: + + - Before data is backed up to the NAS, the system automatically mounts the NAS shared address to a local partition of the backup task execution node. After the backup is complete, the system unmounts the NAS shared partition from the backup task execution node. + - To prevent backup and restoration failures, do not access the shared address where the NAS server has been mounted to, for example, **/srv/BigData/LocalBackup/nas**, during data backup and restoration. + - When service data is backed up to the NAS, DistCp is used. + +Specifications +-------------- + +.. table:: **Table 4** Specifications of the backup and restoration feature + + ======================================================= ============= + Item Specification + ======================================================= ============= + Maximum number of backup or restoration tasks 100 + Number of concurrent tasks in a cluster 1 + Maximum number of waiting tasks 199 + Maximum size (GB) of backup files on a Linux local disk 600 + ======================================================= ============= + +.. note:: + + If service data is stored in the ZooKeeper upper-layer components, ensure that the number of znodes in a single backup or restoration task is not too large. Otherwise, the task will fail, and the ZooKeeper service performance will be affected. To check the number of znodes in a single backup or restoration task, perform the following operations: + + - Ensure that the number of znodes in a single backup or restoration task is smaller than the upper limit of OS file handles. Specifically: + + #. To check the upper limit at the system level, run the **cat /proc/sys/fs/file-max** command. + #. To check the upper limit at the user level, run the **ulimit -n** command. + + - If the number of znodes in the parent directory exceeds the upper limit, back up and restore data in its sub-directories in batches. To check the number of znodes using ZooKeeper client scripts, perform the following operations: + + #. On FusionInsight Manager, choose **Cluster**, click the name of the desired cluster, choose **Services** > **ZooKeeper** > **Instance**, and view the management IP address of each ZooKeeper role. + + #. Log in to the node where the client is located and run the following command: + + **zkCli.sh -server** *ip*\ **:**\ *port*, where, *ip* can be any management IP address, and the default port number is 2181. + + #. If the following information is displayed, login to the ZooKeeper server is successful: + + .. code-block:: + + WatchedEvent state:SyncConnected type:None path:null + [zk: ip:port(CONNECIED) 0] + + #. Run the **getusage** command to check the number of znodes in the directory to be backed up. + + For example, **getusage /hbase/region**. In the command output, **Node count=xxxxxx** indicates the number of znodes stored in the **region** directory. + +.. table:: **Table 5** Specifications of the default task + + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + | Item | OMS | HBase | Kafka | DBService | NameNode | + +=================================+===================================================================================+=========+========+===========+==============================+ + | Backup period | 1 hour | | | | | + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + | Maximum number of backups | 168 (7-day historical data) | | | | 24 (one-day historical data) | + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + | Maximum size of a backup file | 10 MB | 10 MB | 512 MB | 100 MB | 20 GB | + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + | Maximum size of disk space used | 1.64 GB | 1.64 GB | 84 GB | 16.41 GB | 480 GB | + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + | Storage path of backup data | *Data storage path*\ **/LocalBackup/** of the active and standby management nodes | | | | | + +---------------------------------+-----------------------------------------------------------------------------------+---------+--------+-----------+------------------------------+ + +.. note:: + + - The backup data of the default backup task must be periodically transferred and saved outside the cluster based on the enterprise O&M requirements. + - Administrators can create DistCp backup tasks to save OMS, DBService, and NameNode data to external clusters. + - The execution time of a cluster data backup task can be calculated using the following formula: Task execution time = Volume of data to be backed up/Network bandwidth between the cluster and the backup device. In practice, you are advised to multiply the calculated time by 1.5 to get the reference value of the task execution time. + - Executing a data backup task affects the maximum I/O performance of the cluster. Therefore, you are advised to execute a backup task during off-peak hours. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/managing_local_quick_restoration_tasks.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/managing_local_quick_restoration_tasks.rst new file mode 100644 index 0000000..170eb99 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/managing_local_quick_restoration_tasks.rst @@ -0,0 +1,49 @@ +:original_name: admin_guide_000229.html + +.. _admin_guide_000229: + +Managing Local Quick Restoration Tasks +====================================== + +Scenario +-------- + +When DistCp is used to back up data, the backup snapshot is saved to HDFS of the active cluster. FusionInsight Manager supports using the local snapshot for quick data restoration, requiring less time than restoring data from the standby cluster. + +Use FusionInsight Manager and the snapshots on HDFS of the active cluster to create a local quick restoration task and execute the task. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the backup task list, locate a created task and click **Restore** in the **Operation** column. + +#. Check whether the system displays "No data is available for quick restoration. Create a task on the restoration management page to restore data". + + - If yes, click **OK** to close the dialog box. No backup data snapshot is created in the active cluster, and no further action is required. + - If no, go to :ref:`Step 4 ` to create a local quick restoration task. + + .. note:: + + Metadata does not support quick restoration. + +#. .. _admin_guide_000229__en-us_topic_0046736777_q_rec_set: + + Set **Name** to the name of the local quick restoration task. + +#. Set **Configuration** to a data source. + +#. Set **Recovery Point List** to a recovery point that contains the backup data. + +#. Set **Queue Name** to the name of the Yarn queue used in the task execution. The name must be the same as the name of the queue that is running properly in the cluster. + +#. Set **Data Configuration** to the object to be recovered. + +#. Click **Verify**, and wait for the system to display "The restoration task configuration is verified successfully." + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + After the task is complete, **Task Status** of the task is displayed as **Successful**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/modifying_a_backup_task.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/modifying_a_backup_task.rst new file mode 100644 index 0000000..dc42d4d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/modifying_a_backup_task.rst @@ -0,0 +1,47 @@ +:original_name: admin_guide_000230.html + +.. _admin_guide_000230: + +Modifying a Backup Task +======================= + +Scenario +-------- + +This section describes how to modify the parameters of a created backup task on FusionInsight Manager to meet changing service requirements. The parameters of restoration tasks can only be viewed but cannot be modified. + +Impact on the System +-------------------- + +After a backup task is modified, the new parameters take effect when the task is executed next time. + +Prerequisites +------------- + +- A backup task has been created. +- A new backup task policy has been planned based on the actual situation. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the task list, locate a specified task, click **Configure** in the **Operation** column to go to the configuration modification page. + + On the displayed page, modify the following parameters: + + - Started + - Period + - Destination NameService Name + - Target NameNode IP Address + - Target Path + - Max Number of Backup Copies + - Maximum Number of Recovery Points + - Maximum Number of Maps + - Maximum Bandwidth of a Map + + .. note:: + + After the **Target Path** parameter of a backup task is modified, this task will be performed as a full backup task for the first time by default. + +#. Click **OK** to save the settings. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/index.rst new file mode 100644 index 0000000..a83fb75 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/index.rst @@ -0,0 +1,32 @@ +:original_name: admin_guide_000215.html + +.. _admin_guide_000215: + +Recovering Data +=============== + +- :ref:`Restoring Manager Data ` +- :ref:`Restoring ClickHouse Metadata ` +- :ref:`Restoring ClickHouse Service Data ` +- :ref:`Restoring DBService data ` +- :ref:`Restoring HBase Metadata ` +- :ref:`Restoring HBase Service Data ` +- :ref:`Restoring NameNode Data ` +- :ref:`Restoring HDFS Service Data ` +- :ref:`Restoring Hive Service Data ` +- :ref:`Restoring Kafka Metadata ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + restoring_manager_data + restoring_clickhouse_metadata + restoring_clickhouse_service_data + restoring_dbservice_data + restoring_hbase_metadata + restoring_hbase_service_data + restoring_namenode_data + restoring_hdfs_service_data + restoring_hive_service_data + restoring_kafka_metadata diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_metadata.rst new file mode 100644 index 0000000..a752369 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_metadata.rst @@ -0,0 +1,87 @@ +:original_name: admin_guide_000358.html + +.. _admin_guide_000358: + +Restoring ClickHouse Metadata +============================= + +Scenario +-------- + +ClickHouse metadata needs to be restored in the following scenarios: Data is modified or deleted unexpectedly and needs to be restored. After a user performs major operations (such as upgrade and migration) on ZooKeeper, an exception occurs or the expected result is not achieved. The ClickHouse component is faulty and becomes unavailable. Data is migrated to a new cluster. + +Users can create a ClickHouse restoration task on FusionInsight Manager. Only manual restoration tasks are supported. + +.. important:: + + - This function is supported only by MRS 3.1.0 or later. + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To restore ClickHouse metadata when the service is running properly, you are advised to manually back up the latest ClickHouse metadata before restoration. Otherwise, the ClickHouse metadata that is generated after the data backup and before the data restoration will be lost. + - ClickHouse metadata restoration and service data restoration cannot be performed at the same time. Otherwise, service data restoration fails. You are advised to restore service data after metadata restoration is complete. + +Impact on the System +-------------------- + +- After the metadata is restored, the data generated after the data backup and before the data restoration is lost. +- After the metadata is restored, the ClickHouse upper-layer applications need to be started. + +Prerequisites +------------- + +- You have checked the path for storing ClickHouse metadata backup files. +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust must be configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of the specified task in the task list, choose **More** > **View History**. + + In the window that is displayed, select a success record and click **View** in the **Backup Path** column to view its backup path information and find the following information: + + - **Backup Object**: indicates the backup data source. + + - **Backup Path**: indicates the full path where backup files are stored. + + Select the correct path, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In **Restoration Configuration**, select **ClickHouse** under **Metadata and other data**. + +#. Set **Path Type** of **ClickHouse** to a restoration directory type. + + The configurations vary based on backup directory types: + + - **LocalDir**: indicates that data is restored from the local disk of the active management node. + + If you select this value, you also need to configure the following parameters: + + - **Source Path**: backup file to be restored, for example, *Backup task name*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. + - **Logical Cluster**: Enter the ClickHouse logical cluster whose data has been backed up. + + - **RemoteHDFS**: indicates that data is restored from the HDFS directory of the standby cluster. + + If you select this value option, you also need to configure the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster, for example, **hacluster**. You can obtain it from the **NameService Management** page of HDFS of the standby cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Data source_Task execution time*\ **.tar.gz**. + +#. Click **OK**. + +#. In the restoration task list, locate the row where the created task is located, and click **Start** in the **Operation** column. In the displayed dialog box, click **OK** to start the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. + +#. Choose **Cluster** > **Services** and start the ClickHouse service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_service_data.rst new file mode 100644 index 0000000..b6d83eb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_clickhouse_service_data.rst @@ -0,0 +1,86 @@ +:original_name: admin_guide_000359.html + +.. _admin_guide_000359: + +Restoring ClickHouse Service Data +================================= + +Scenario +-------- + +ClickHouse data needs to be restored in the following scenarios: Data is modified or deleted unexpectedly and needs to be restored. After a user performs major operations (such as upgrade and migration) on ClickHouse, an exception occurs or the expected result is not achieved. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +Users can create a ClickHouse restoration task on FusionInsight Manager to restore data. Only manual restoration tasks are supported. + +The ClickHouse backup and restoration functions cannot identify the service and structure relationships of objects such as ClickHouse tables, indexes, and views. When executing backup and restoration tasks, you need to manage unified restoration points based on service scenarios to ensure proper service running. + +.. important:: + + - This function is supported only by MRS 3.1.0 or later. + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To restore the data when services are normal, manually back up the latest management data first and then restore the data. Otherwise, the ClickHouse data that is generated after the data backup and before the data restoration will be lost. + - ClickHouse metadata restoration and service data restoration cannot be performed at the same time. Otherwise, service data restoration fails. You are advised to restore service data after metadata restoration is complete. + +Impact on the System +-------------------- + +- During data restoration, user authentication stops and users cannot create new connections. +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is restored, the ClickHouse upper-layer applications need to be started. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust must be configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The database for storing restored data tables, the HDFS save path of data tables, and the list of users who can access restored data are planned. +- The ClickHouse backup file save path is correct. +- The ClickHouse upper-layer applications are stopped. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the row where the specified backup task is located, choose **More** > **View History** in the **Operation** column to display the historical execution records of the backup task. + + In the window that is displayed, select a success record and click **View** in the **Backup Path** column to view its backup path information and find the following information: + + - **Backup Object**: indicates the backup data source. + + - **Backup Path**: indicates the full path where backup files are stored. + + Select the correct path, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In **Restoration Configuration**, select **ClickHouse** under **Service data**. + +#. Set **Path Type** of **ClickHouse** to a backup directory type. + + Currently, the backup directory supports only the **RemoteHDFS** type. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. If you select this value option, you also need to configure the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster, for example, **hacluster**. You can obtain it from the **NameService Management** page of HDFS of the standby cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the IP address of the NameNode service plane in the standby cluster. It can be of an active or standby node. + - Source Path: indicates the full path of HDFS directory for storing backup data of the standby cluster. For details, see the **Backup Path** obtained in step 2, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + +#. Click **OK**. + +#. In the restoration task list, locate the row where the created task is located, and click **Start** in the **Operation** column. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_dbservice_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_dbservice_data.rst new file mode 100644 index 0000000..457cab2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_dbservice_data.rst @@ -0,0 +1,139 @@ +:original_name: admin_guide_000217.html + +.. _admin_guide_000217: + +Restoring DBService data +======================== + +Scenario +-------- + +DBService data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in DBService, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover DBService data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that of data backup. + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the DBService data that is generated after the data backup and before the data recovery will be lost. + - By default, MRS clusters use DBServices to store metadata of Hive, Hue, Loader, Spark, and Oozie. Restoring DBService data will restore the metadata of all these components. + +Impact on the System +-------------------- + +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is restored, the configurations of the components that depend on DBService may expire and these components need to be restarted. + +Prerequisites +------------- + +- To restore data from a remote HDFS, you need to prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- The status of the active and standby DBService instances is normal. If the status is abnormal, data restoration cannot be performed. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view the historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In the **Restoration Configuration** area, select **DBService**. + + .. note:: + + If multiple DBServices are installed, select the DBServices to be restored. + +#. Set **Path Type** of **DBService** to a backup directory type. + + The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. + + If you select **LocalDir**, you also need to set **Source Path** to select the backup file to be restored, for example, *Version_Data source_Task execution time*\ **.tar.gz**. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select **LocalHDFS**, set the following parameters: + + - **Source Path**: indicates the full path of the backup file in the HDFS, for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Source NameService Name**: indicates the NameService name that corresponds to the backup directory when a restoration task is executed. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. + + If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select **OBS**, set the following parameters: + + - **Source Path**: indicates the full OBS path of a backup file, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + .. note:: + + Only MRS 3.1.0 or later supports saving backup files in OBS. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_metadata.rst new file mode 100644 index 0000000..352a7e3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_metadata.rst @@ -0,0 +1,138 @@ +:original_name: admin_guide_000218.html + +.. _admin_guide_000218: + +Restoring HBase Metadata +======================== + +Scenario +-------- + +To ensure HBase metadata security (including tableinfo files and HFiles) or before a major operation on HBase system tables (such as upgrade or migration), you need to back up HBase metadata to prevent HBase service unavailability caused by HBase system table directory or file damages. The backup data can be used to recover the system if an exception occurs or the operation has not achieved the expected result, minimizing the adverse impacts on services. + +System administrators can create a recovery task in FusionInsight Manager to recover HBase metadata. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the HBase data that is generated after the data backup and before the data recovery will be lost. + + - It is recommended that a data restoration task restore the metadata of only one component to prevent the data restoration of other components from being affected by stopping a service or instance. If data of multiple components is restored at the same time, data restoration may fail. + + HBase metadata cannot be restored at the same time as NameNode metadata. As a result, data restoration fails. + +Impact on the System +-------------------- + +- Before restoring the metadata, you need to stop the HBase service, during which the HBase upper-layer applications are unavailable. +- After the metadata is restored, the data generated after the data backup and before the data restoration is lost. +- After the metadata is restored, the upper-layer applications of HBase need to be started. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- You have checked the path for storing HBase metadata backup files. +- The HBase service has been stopped before its metadata is restored. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view the historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In **Restoration Configuration**, select **HBase** under **Metadata and other data**. + + .. note:: + + If multiple HBase services are installed, select the HBase services to be restored. + +#. Set **Path Type** of **HBase** to a backup directory type. + + The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. + + If you select **LocalDir**, you also need to set **Source Path** to select the backup file to be restored, for example, *Version_Data source_Task execution time*\ **.tar.gz**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. + + If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **CIFS**: indicates that backup files are stored in NAS using the CIFS protocol. + + If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select **OBS**, set the following parameters: + + - **Source Path**: indicates the full OBS path of a backup file, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + .. note:: + + Only MRS 3.1.0 or later supports saving backup files in OBS. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_service_data.rst new file mode 100644 index 0000000..b9fddb0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hbase_service_data.rst @@ -0,0 +1,141 @@ +:original_name: admin_guide_000219.html + +.. _admin_guide_000219: + +Restoring HBase Service Data +============================ + +Scenario +-------- + +HBase data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in HBase, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover HBase data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the HBase data that is generated after the data backup and before the data recovery will be lost. + +Impact on the System +-------------------- + +- During the data recovery process, the system disables the HBase table to be recovered and the table cannot be accessed in this moment. The data recovery process takes several minutes, during which the HBase upper-layer applications are unavailable. +- During data restoration, user authentication stops and users cannot create new connections. +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is recovered, the HBase upper-layer applications need to be started. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The directory for saving the backup file has been checked. +- The HBase upper-layer applications have been stopped. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In **Restoration Configuration**, select **HBase** under **Service Data**. + +#. Set **Path Type** of **HBase** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of the backup file in the HDFS, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + +#. Set **Backup Data** column in **Data Configuration** to one or multiple backup data sources to be recovered. In the **Target Namespace** column, specify the target naming space after backup data recovery. + + You are advised to set **Target Namespace** to a location that is different from the backup naming space. + +#. Set **Force recovery** to **true**, which indicates to forcibly recover all backup data when a data table with the same name already exists. If the data table contains new data added after backup, the new data will be lost after the data recovery. If you set the parameter to **false**, the restoration task is not executed if a data table with the same name exists. + +#. Click **Verify** to check whether the restoration task is configured correctly. + + - If the queue name is incorrect, the verification fails. + - If the specified naming space does not exist, the verification fails. + - If the forcibly replacement conditions are not met, the verification fails. + +#. Click **OK** to save the settings. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. + +#. Check whether HBase data is restored in an environment where HBase is newly installed or reinstalled. + + - If yes, the administrator needs to set new permission for roles on FusionInsight Manager based on the original service plan. + - If no, no further operation is required. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hdfs_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hdfs_service_data.rst new file mode 100644 index 0000000..07f1922 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hdfs_service_data.rst @@ -0,0 +1,139 @@ +:original_name: admin_guide_000223.html + +.. _admin_guide_000223: + +Restoring HDFS Service Data +=========================== + +Scenario +-------- + +HDFS data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in the HDFS, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover HDFS data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the HDFS data that is generated after the data backup and before the data recovery will be lost. + - The HDFS restoration operation cannot be performed for the directories used by running Yarn tasks, for example, **/tmp/logs**, **/tmp/archived**, and **/tmp/hadoop-yarn/staging**. Otherwise, data restoration using Distcp tasks fails due to file loss. + +Impact on the System +-------------------- + +- During data restoration, user authentication stops and users cannot create new connections. +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is recovered, the HDFS upper-layer applications need to be started. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The HDFS backup file save path is correct. +- The HDFS upper-layer applications are stopped. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In **Restoration Configuration**, select **HDFS** under **Service Data**. + +#. Set **Path Type** of **HDFS** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **NFS**: indicates that backup files are stored in NAS using the NFS protocol. If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **CIFS**: indicates that backup files are stored in NAS using the CIFS protocol. If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + +#. In the **Backup Data** column of the **Data Configuration** page, select one or more pieces of backup data that needs to be restored based on service requirements. In the **Target Path** column, specify the target location after backup data restoration. + + You are advised to set **Target Path** to a new path that is different from the backup path. + +#. Click **Verify** to check whether the restoration task is configured correctly. + + - If the queue name is incorrect, the verification fails. + - If the specified directory to be restored does not exist, the verification fails. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hive_service_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hive_service_data.rst new file mode 100644 index 0000000..2049916 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_hive_service_data.rst @@ -0,0 +1,147 @@ +:original_name: admin_guide_000224.html + +.. _admin_guide_000224: + +Restoring Hive Service Data +=========================== + +Scenario +-------- + +Hive data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in the Hive, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover Hive data. Only manual restoration tasks are supported. + +Hive backup and restoration cannot identify the service and structure relationships of objects such as Hive tables, indexes, and views. When executing backup and restoration tasks, you need to manage unified restoration points based on service scenarios to ensure proper service running. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the Hive data that is generated after the data backup and before the data recovery will be lost. + +Impact on the System +-------------------- + +- During data restoration, user authentication stops and users cannot create new connections. +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is recovered, the Hive upper-layer applications need to be started. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The database for storing restored data tables, the HDFS save path of data tables, and the list of users who can access restored data are planned. +- The Hive backup file save path is correct. +- The Hive upper-layer applications are stopped. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In the **Restoration Configuration** area, select **Hive**. + +#. Set **Path Type** of **Hive** to a backup directory type. + + The following backup directory types are supported: + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select a Hive backup file set that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **NFS**: indicates that backup files are stored in NAS using the NFS protocol. If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select a Hive backup file set that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **CIFS**: indicates that backup files are stored in NAS using the CIFS protocol. If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select a Hive backup file set that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **100**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time*. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. + - **Recovery Point List**: Click **Refresh** and select an HDFS directory that has been backed up in the standby cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + - **Maximum Number of Maps**: indicates the maximum number of maps in a MapReduce task. The default value is **20**. + - **Maximum Bandwidth of a Map (MB/s)**: indicates the maximum bandwidth of a map. The default value is **1**. + +#. Set **Backup Data** in the **Data Configuration** to one or multiple backup data sources to be recovered based on service requirements. In the **Target Database** and **Target Path** columns, specify the target database and file save path after backup data recovery. + + Configuration restrictions: + + - Data can be restored to the original database, but data tables must be stored in a new path that is different from the backup path. + - To restore Hive index tables, select the Hive data tables that correspond to the Hive index tables to be restored. + - If a new restoration directory is selected to avoid affecting the current data, HDFS permission must be manually granted so that users who have permission of backup tables can access this directory. + - Data can be restored to other databases. In this case, HDFS permission must be manually granted so that users who have permission of backup tables can access the HDFS directory that corresponds to the database. + +#. Set **Force recovery** to **true**, which indicates to forcibly recover all backup data when a data table with the same name already exists. If the data table contains new data added after backup, the new data will be lost after the data recovery. If you set the parameter to **false**, the restoration task is not executed if a data table with the same name exists. + +#. Click **Verify** to check whether the restoration task is configured correctly. + + - If the queue name is incorrect, the verification fails. + - If the specified directory to be restored does not exist, the verification fails. + - If the forcibly replacement conditions are not met, the verification fails. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_kafka_metadata.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_kafka_metadata.rst new file mode 100644 index 0000000..eb20115 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_kafka_metadata.rst @@ -0,0 +1,139 @@ +:original_name: admin_guide_000225.html + +.. _admin_guide_000225: + +Restoring Kafka Metadata +======================== + +Scenario +-------- + +Kafka data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in ZooKeeper, an exception occurs or the operation has not achieved the expected result. All Kafka modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover Kafka data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To restore Kafka metadata when the service is running properly, you are advised to manually back up the latest Kafka metadata before restoration. Otherwise, the Kafka metadata that is generated after the data backup and before the data restoration will be lost. + +Impact on the System +-------------------- + +- After the metadata is restored, the data generated after the data backup and before the data restoration is lost. +- After the metadata is restored, the offset information stored on ZooKeeper by Kafka consumers is rolled back, resulting in repeated consumption. + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- The Kafka service is disabled first, and then enabled upon data restoration. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In the **Restoration Configuration** area, select **Kafka**. + + .. note:: + + If multiple Kafka services are installed, select the Kafka services to be restored. + +#. Set **Path Type** of **Kafka** to a backup directory type. + + The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. + + If you select **LocalDir**, you also need to set **Source Path** to select the backup file to be restored, for example, *Version_Data source_Task execution time*\ **.tar.gz**. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select **LocalHDFS**, set the following parameters: + + - **Source Path**: indicates the full path of the backup file in the HDFS, for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Source NameService Name**: indicates the NameService name that corresponds to the backup directory when a restoration task is executed. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in NAS using the NFS protocol. + + If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **CIFS**: indicates that backup files are stored in NAS using the CIFS protocol. + + If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select **OBS**, set the following parameters: + + - **Source Path**: indicates the full OBS path of a backup file, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + .. note:: + + Only MRS 3.1.0 or later supports saving backup files in OBS. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. + + .. important:: + + - If the Kafka service is deleted after the backup is complete, reinstall the Kafka service, restore its metadata, and restart the Kafka service. It is found that the Broker service cannot be started. In this case, the **/var/log/Bigdata/kafka/broker/server.log** file contains an error. An error example is as follows: + + .. code-block:: + + ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)kafka.common.InconsistentClusterIdException: The Cluster ID kVSgfurUQFGGpHMTBqBPiw doesn't match stored clusterId Some(0Qftv9yBTAmf2iDPSlIk7g) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong. at kafka.server.KafkaServer.startup(KafkaServer.scala:220) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44) at kafka.Kafka$.main(Kafka.scala:84) at kafka.Kafka.main(Kafka.scala) + + Check the value of **log.dirs** in the Kafka Broker configuration file **${BIGDATA_HOME}/Fusionsight_Current/*Broker/etc/server.properties**. The value is the Kafka data directory. Go to the Kafka data directory and change the value **0Qftv9yBTAmf2iDPSlIk7g** of **cluster.id** in **meta.properties** to **kVSgfurUQFGGpHMTBqBPiw** (the latest value in the error log). + + - The preceding modification must be performed on each node where Broker is located. After the modification, restart the Kafka service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_manager_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_manager_data.rst new file mode 100644 index 0000000..dd8078d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_manager_data.rst @@ -0,0 +1,155 @@ +:original_name: admin_guide_000216.html + +.. _admin_guide_000216: + +Restoring Manager Data +====================== + +Scenario +-------- + +Manager data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in FusionInsight Manager, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. + +System administrators can create a restoration task in FusionInsight Manager to recover Manager data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that of data backup. + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the Manager data that is generated after the data backup and before the data restoration will be lost. + +Impact on the System +-------------------- + +- In the restoration process, the Controller needs to be restarted and FusionInsight Manager cannot be logged in or operated during the restart. +- In the restoration process, all clusters need to be restarted and cannot be accessed during the restart. +- After data restoration, the data, such as system configuration, user information, alarm information, and audit information, that is generated after the data backup and before the data restoration will be lost. This may result in data query failure or cluster access failure. +- After the Manager data is recovered, the system forces the LdapServer of each cluster to synchronize data from the OLadp. + +Prerequisites +------------- + +- To restore data from a remote HDFS, you need to prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, system mutual trust needs to be configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. + +- The status of the OMS resources and the LdapServer instances of each cluster is normal. If the status is abnormal, data restoration cannot be performed. +- The status of the cluster hosts and services is normal. If the status is abnormal, data restoration cannot be performed. +- The cluster host topologies during data restoration and data backup are the same. If the topologies are different, data restoration cannot be performed and you need to back up data again. +- The services added to the cluster during data restoration and data backup are the same. If the topologies are different, data restoration cannot be performed and you need to back up data again. +- The upper-layer applications that depend on the cluster are stopped. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view the historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restore** > **Restoring Management**. On the displayed page, click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Set **Recovery Object** to **OMS**. + +#. Select **OMS**. + +#. Set **Path Type** of **OMS** to a backup directory type. + + The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. + + If you select **LocalDir**, you also need to set **Source Path** to select the backup file to be restored, for example, *Version_Data source_Task execution time*\ **.tar.gz**. + + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. + + If you select **LocalHDFS**, set the following parameters: + + - **Source Path**: indicates the full path of the backup file in the HDFS, for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Cluster for Restoration**: Enter the name of the cluster used during restoration task execution. + - **Source NameService Name**: indicates the NameService name that corresponds to the backup directory when a restoration task is executed. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Source Cluster**: Select the cluster of the Yarn queue used by the recovery data. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the complete path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select **OBS**, set the following parameters: + + - **Source Path**: indicates the full OBS path of a backup file, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + + .. note:: + + Only MRS 3.1.0 or later supports saving backup files in OBS. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. + +#. Log in to the active and standby management nodes as user **omm** using PuTTY. + +#. Run the following command to restart OMS: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/restart-oms.sh** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + start HA successfully. + + Run **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** to check whether **HAAllResOK** of the management node is **Normal** and whether FusionInsight Manager can be logged in again. If yes, OMS is restarted successfully. + +#. On FusionInsight Manager, click **Cluster**, click the name of the target cluster, and choose **Services** > **KrbServer**. On the displayed page, choose **More** > **Synchronize Configuration**, click **OK**, and wait for the KrbServer configuration to be synchronized and the service to be restarted. + +#. Choose **Cluster**, click the name of the desired cluster, and choose **More** > **Synchronize Configurations**, click **OK**, and wait until the cluster configuration is synchronized successfully. + +#. On FusionInsight Manager, click **Cluster**, click the name of the target cluster, and choose **More** > **Restart**. On the displayed page, enter the password of the current login user, click **OK**, and wait for the cluster to be restarted. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_namenode_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_namenode_data.rst new file mode 100644 index 0000000..9a84c21 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/recovering_data/restoring_namenode_data.rst @@ -0,0 +1,143 @@ +:original_name: admin_guide_000222.html + +.. _admin_guide_000222: + +Restoring NameNode Data +======================= + +Scenario +-------- + +NameNode data needs to be recovered in the following scenarios: data is modified or deleted unexpectedly and needs to be restored. After an administrator performs critical data adjustment in NameNode, an exception occurs or the operation has not achieved the expected result. All modules are faulty and become unavailable. Data is migrated to a new cluster. + +System administrators can create a recovery task in FusionInsight Manager to recover NameNode data. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + + - To recover data when the service is running properly, you are advised to manually back up the latest management data before recovering data. Otherwise, the NameNode data that is generated after the data backup and before the data recovery will be lost. + + - It is recommended that a data restoration task restore the metadata of only one component to prevent the data restoration of other components from being affected by stopping a service or instance. If data of multiple components is restored at the same time, data restoration may fail. + + HBase metadata cannot be restored at the same time as NameNode metadata. As a result, data restoration fails. + +Impact on the System +-------------------- + +- After the data is restored, the data generated after the data backup and before the data restoration is lost. +- After the data is recovered, the NameNode needs to be restarted and is unavailable during the restart. +- After data is restored, metadata and service data may not be matched, the HDFS enters the security mode, and the HDFS service fails to be started. . + +Prerequisites +------------- + +- If you need to restore data from a remote HDFS, prepare a standby cluster. If the active cluster is deployed in security mode and the active and standby clusters are not managed by the same FusionInsight Manager, mutual trust has been configured. For details, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. If the active cluster is deployed in normal mode, no mutual trust is required. + +- Cross-cluster replication has been configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Replication `. +- Time is consistent between the active and standby clusters and the NTP services on the active and standby clusters use the same time source. +- You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. +- On FusionInsight Manager, all the NameNode role instances whose data is to be recovered are stopped. Other HDFS role instances must keep running. After data is recovered, the NameNode role instances need to be restarted. The NameNode role instances cannot be accessed during the restart. +- The NameNode backup files are stored **Data path/LocalBackup/** on the active management node. + +Procedure +--------- + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Services** > **HDFS**. On the displayed page, click **Instance** and click **NameNode** to check whether the NameNode instances of the data to be restored are stopped. If the NameNode instances are not stopped, stop them. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Backup Management**. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view historical backup task execution records. + + In the displayed window, locate a specified success record and click **View** in the **Backup Path** column to view the backup path information of the task and find the following information: + + - **Backup Object** specifies the data source of the backup data. + + - **Backup Path** specifies the full path where the backup files are saved. + + Select the correct item, and manually copy the full path of backup files in **Backup Path**. + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration** > **Restoration Management**. + +#. Click **Create**. + +#. Set **Task Name** to the name of the restoration task. + +#. Select the cluster to be operated from **Recovery Object**. + +#. In the **Restoration Configuration** area, select **NameNode**. + +#. Set **Path Type** of **NameNode** to a backup directory type. + + The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. + + If you select **LocalDir**, set the following parameters: + + - **Source Path**: indicates the full path of the backup file on the local disk, for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **RemoteHDFS**: indicates that the backup files are stored in the HDFS directory of the standby cluster. + + If you select **RemoteHDFS**, set the following parameters: + + - **Source NameService Name**: indicates the NameService name of the backup data cluster. You can enter the built-in NameService name of the remote cluster, for example, **haclusterX**, **haclusterX1**, **haclusterX2**, **haclusterX3**, or **haclusterX4**. You can also enter a configured NameService name of the remote cluster. + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Source NameNode IP Address**: indicates the NameNode service plane IP address of the standby cluster, supporting the active node or standby node. + - **Source Path**: indicates the full path of HDFS directory for storing backup data of the standby cluster, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Queue Name**: indicates the name of the Yarn queue used for backup task execution. The name must be the same as the name of the queue that is running properly in the cluster. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **NFS**: indicates that backup files are stored in the NAS using the NFS protocol. If you select **NFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **CIFS**: indicates that backup files are stored in the NAS using the CIFS protocol. If you select **CIFS**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + - **Server IP Address**: indicates the IP address of the NAS server. + - **Port**: indicates the port number used to connect to the NAS server over the CIFS protocol. The default value is **445**. + - **Username**: indicates the username set when the CIFS protocol is configured. + - **Password**: indicates the password set when the CIFS protocol is configured. + - **Source Path**: indicates the full path of the backup file on the NAS server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **SFTP**: indicates that backup files are stored in the server using the SFTP protocol. + + If you select **SFTP**, set the following parameters: + + - **IP Mode**: indicates the mode of the target IP address. The system automatically selects the IP address mode based on the cluster network type, for example, **IPv4** or **IPv6**. + + - **Server IP Address**: indicates the IP address of the server where the backup data is stored. + - **Port**: indicates the port number used to connect to the backup server over the SFTP protocol. The default value is **22**. + - **Username**: indicates the username for connecting to the server using the SFTP protocol. + - **Password**: indicates the password for connecting to the server using the SFTP protocol. + - **Source Path**: indicates the full path of the backup file on the backup server, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Target NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + - **OBS**: indicates that backup files are stored in OBS. + + If you select **OBS**, set the following parameters: + + - **Source Path**: indicates the full OBS path of a backup file, for example, *Backup path/Backup task name_Data source_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **NameService Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + .. note:: + + Only MRS 3.1.0 or later supports saving backup files in OBS. + +#. Click **OK**. + +#. In the restoration task list, locate a created task and click **Start** in the **Operation** column to execute the restoration task. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and click **Retry** to execute the task again. + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Services** > **HDFS**. On the displayed page, click **Configurations** and click **All Configurations**. + + On the displayed page, enter the password of the administrator who has logged in for authentication and click **OK**. After the system displays "Operation succeeded", click **Finish**. The service is started successfully. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/viewing_backup_and_restoration_tasks.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/viewing_backup_and_restoration_tasks.rst new file mode 100644 index 0000000..37f627d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/backup_and_recovery_management/viewing_backup_and_restoration_tasks.rst @@ -0,0 +1,50 @@ +:original_name: admin_guide_000231.html + +.. _admin_guide_000231: + +Viewing Backup and Restoration Tasks +==================================== + +Scenario +-------- + +This section describes how to view created backup and recovery tasks and check their running status on FusionInsight Manager. + +Prerequisites +------------- + +You have logged in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +Procedure +--------- + +#. On FusionInsight Manager, choose **O&M** > **Backup and Restoration**. + +#. Click **Backup Management** or **Restoration Management**. + +#. In the task list, obtain the previous execution result in the **Task Status** and **Task Progress** column. Green indicates that the task is executed successfully, and red indicates that the execution fails. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** or click **View History** to view the historical record of backup and restoration task execution. + + In the displayed window, click |image1| before a specified record to display log information about the execution. + +Related Tasks +------------- + +- Starting a backup or restoration task + + In the task list, locate a specified task and choose **More** > **Back Up Now** or click **Start** in the **Operation** column to start a backup or restoration task that is ready or fails to be executed. Executed restoration tasks cannot be repeatedly executed. + +- Stopping a backup or restoration task + + In the task list, locate a specified task and choose **More** > **Stop** or click **Stop** in the **Operation** column to stop a backup or restoration task that is running. After the task is successfully stopped, its **Task Status** changes to **Stopped**. + +- Deleting a backup or restoration task + + In the task list, locate a specified task and choose **More** > **Delete** or click **Delete** in the **Operation** column to delete a backup or restoration task. Backup data will be reserved by default after a task is deleted. + +- Suspending a backup task + + In the task list, locate a specified task and choose **More** > **Suspend** in the **Operation** column to suspend a backup task. Only periodic backup tasks can be suspended. Suspended backup tasks are no longer executed automatically. When you suspend a backup task that is being executed, the task execution stops. To resume a task, choose **More** > **Resume**. + +.. |image1| image:: /_static/images/en-us_image_0263899323.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/downloading_the_client.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/downloading_the_client.rst new file mode 100644 index 0000000..37e3aff --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/downloading_the_client.rst @@ -0,0 +1,46 @@ +:original_name: admin_guide_000014.html + +.. _admin_guide_000014: + +Downloading the Client +====================== + +Scenario +-------- + +Use the default client provided by MRS clusters to manage the cluster, run services, and perform secondary development. Before you use this client, you need to download its software package. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Dashboard**. On the page that is displayed, choose **More** > **Download Client**. + + The **Download Cluster Client** dialog box is displayed. + +#. Select a client type for **Select Client Type**. + + - **Complete Client**: the package contains scripts, compilation files, and configuration files. + + - **Configuration Files Only**: the package contains only the client configuration files. + + This type is applicable to application development tasks. For example, after a complete client is downloaded and installed, the cluster administrator modifies the service configuration on FusionInsight Manager, and developers need to update the client configuration files. + + .. note:: + + Set **Select Platform Type** to **x86_64** or **aarch64**. To run the client on x86 nodes, select **x86_64**; to tun the client on TaiShan nodes, select **aarch64**. By default, you should select a client that has the same architecture as your servers. + +#. Determine whether to generate a client software package file on the cluster node. + + - If yes, select **Save to Path** and click **OK** to generate the client file. + + The generated file is stored in the **/tmp/FusionInsight-Client** directory on the active management node by default. You can also store the client file in other directories, and user **omm** has the read, write, and execute permissions on the directory. If the client file already exists in the path, the existing client file will be replaced. + + After the file is generated, copy the obtained package to another directory, for example, **/opt/Bigdata/hadoopclient**, as user **omm** or client installation user. + + - If no, click **OK** to download the client file to the local host. + + The system starts to download the client software package. + + Install the downloaded client by referring to :ref:`Installing a Client `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/index.rst new file mode 100644 index 0000000..479ce9d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/index.rst @@ -0,0 +1,28 @@ +:original_name: admin_guide_000010.html + +.. _admin_guide_000010: + +Cluster Management +================== + +- :ref:`Overview ` +- :ref:`Performing a Rolling Restart of a Cluster ` +- :ref:`Managing Expired Configurations ` +- :ref:`Downloading the Client ` +- :ref:`Modifying Cluster Attributes ` +- :ref:`Managing Cluster Configurations ` +- :ref:`Managing Static Service Pools ` +- :ref:`Managing Clients ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + performing_a_rolling_restart_of_a_cluster + managing_expired_configurations + downloading_the_client + modifying_cluster_attributes + managing_cluster_configurations + managing_static_service_pools/index + managing_clients/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/batch_upgrading_clients.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/batch_upgrading_clients.rst new file mode 100644 index 0000000..6257dd6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/batch_upgrading_clients.rst @@ -0,0 +1,77 @@ +:original_name: admin_guide_000023.html + +.. _admin_guide_000023: + +Batch Upgrading Clients +======================= + +Scenario +-------- + +The client package downloaded from FusionInsight Manager contains the client batch upgrade tool. When multiple clients need to be upgraded after the cluster upgrade or scale-out, you can use this tool to upgrade the clients in batches with a few clicks. In addition, the tool provides the lightweight function of batch updating the **/etc/hosts** file on the nodes where the clients are located. + +Procedure +--------- + +**Prepare for the client upgrade.** + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster**, click the name of the desired cluster, click **More**, and select **Download Client** to download the complete client to the specified directory on the server. + + For details, see :ref:`Downloading the Client `. + + Decompress the downloaded client package and find the **batch_upgrade** directory, for example, **/tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/batch_upgrade**. + +#. Choose **Cluster**, click the name of the desired cluster, and choose **Client Management**. On the **Client Management** page, click **Export All** to export all client information to the local PC. + +#. Decompress the exported client information and upload the **client-info.cfg** file to the **batch_upgrade** directory. + +#. Supplement the password in the **client-info.cfg** file by referring to :ref:`Reference Information `. + +**Upgrade clients in batches.** + +6. Run the **sh client_batch_upgrade.sh -u -f /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_Client.tar** **-g** **/tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/batch_upgrade/client-info.cfg** command to perform the upgrade. + + .. important:: + + You are advised to delete the **client-info.cfg** file as soon as possible after the upgrade because the password has been configured. + +7. After the upgrade is complete, verify the upgrade result by running the **sh client_batch_upgrade.sh -c** command. +8. If the client is faulty, run the **sh client_batch_upgrade.sh -s** command to roll back the client. + + .. note:: + + - The client batch upgrade tool moves the original client to the backup directory, and then uses the client package specified by the **-f** parameter to install the client. Therefore, if the original client contains customized content, manually save the customized content from the backup directory or move the customized content to the client directory after the upgrade before running the **-c** command. The backup path on the client is *{Original client path}*\ **-backup**. + - The **-u** command is the prerequisite for the **-c** and **-s** commands. You can run the **-c** command to commit the upgrade or the **-s** command to perform a rollback only after the **-u** command is executed to perform an upgrade. + - You can run the **-u** command multiple times to upgrade only the clients that fail to be upgraded. + - The client batch upgrade tool also supports the clients of earlier versions. + - When upgrading a client installed by a non-root user, ensure that the user has the read and write permissions on the directory where the client is located and the parent directory on the target node. Otherwise, the upgrade will fail. + - The client package specified by the **-f** parameter must be a full client package. The client packages of a single component or some components cannot be used as the input. + +.. _admin_guide_000023__section596192114916: + +Reference Information +--------------------- + +Before upgrading clients in batches, you need to manually configure the user password for remotely logging in to the client node. + +Run the **vi client-info.cfg** command to add a user password. + +Example: + +.. code-block:: + + clientIp,clientPath,user,password + 10.10.10.100,/home/omm/client /home/omm/client2,omm,Password + +The fields in the configuration file are as follows: + +- **clientIp**: indicates the IP address of the node where the client is located. +- **clientPath**: indicates the client installation path. Multiple paths are separated by spaces. Note that the path cannot end with a slash (/). +- **user**: indicates the username of the node. +- **password**: indicates the user password of the node. + + .. note:: + + If the execution fails, view the **node.log** file in the **work_space/log**\ \_\ *XXX* directory. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/index.rst new file mode 100644 index 0000000..ad52947 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000021.html + +.. _admin_guide_000021: + +Managing Clients +================ + +- :ref:`Managing a Client ` +- :ref:`Batch Upgrading Clients ` +- :ref:`Updating the hosts File in Batches ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_a_client + batch_upgrading_clients + updating_the_hosts_file_in_batches diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/managing_a_client.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/managing_a_client.rst new file mode 100644 index 0000000..7c6daf4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/managing_a_client.rst @@ -0,0 +1,53 @@ +:original_name: admin_guide_000022.html + +.. _admin_guide_000022: + +Managing a Client +================= + +Scenario +-------- + +FusionInsight Manager supports unified management of cluster client installation information. After a user downloads and installs a client, FusionInsight Manager automatically records information about the installed (registered) client to facilitate query and management. In addition, you can manually add or modify the information about clients that are not automatically registered, for example, clients installed in earlier versions. + +Procedure +--------- + +**View client information.** + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster**, click the name of the desired cluster, and choose **Client Management** to view information about clients installed in the cluster. + + You can view the IP address, installation path, component list, registration time, and installation user of the node where the client is located. + + When the client is downloaded and installed in the cluster of the latest version, the client information is automatically registered. + +**Add client information.** + +3. To manually add information about an installed client, click **Add** and manually add the IP address, installation path, user, platform information, and registration information of the client as prompted. +4. Configure the client information and click **OK**. + +**Modify client information.** + +5. Modify information about the manually registered client. + + On the **Client Management** page, select the target client and click **Modify**. After modifying the information, click **OK**. + +**Delete client information.** + +6. On the **Client Management** page, select the target client and click **Delete**. In the displayed dialog box, click **OK**. + + To delete multiple clients, select the all of them and click **Batch Delete**. In the displayed dialog box, click **OK**. + +**Export client information.** + +7. On the **Client Management** page, click **Export All** to export information about all registered clients to the local PC. + + .. note:: + + On the **Client Management** page, only components that have clients are displayed in the component list. Therefore, some components that do not have clients and have special components are not displayed. + + The following components are not displayed: + + LdapServer, KrbServer, DBService, Hue, MapReduce, and Flume diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/updating_the_hosts_file_in_batches.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/updating_the_hosts_file_in_batches.rst new file mode 100644 index 0000000..d9e9afb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_clients/updating_the_hosts_file_in_batches.rst @@ -0,0 +1,34 @@ +:original_name: admin_guide_000024.html + +.. _admin_guide_000024: + +Updating the hosts File in Batches +================================== + +Scenario +-------- + +The client package downloaded from FusionInsight Manager contains the client batch upgrade tool. This tool provides the function of upgrading clients in batches and the lightweight function of batch updating the **/etc/hosts** file on the node where the client is located. + +Prerequisites +------------- + +You have made preparations for the upgrade. For details, see "Prepare for the client upgrade." in :ref:`Batch Upgrading Clients `. + + +Updating the hosts File in Batches +---------------------------------- + +#. Check whether the user configured for the node where the **/etc/hosts** file needs to be updated is **root**. + + - If yes, go to :ref:`2 `. + - If no, change the user to **root** and go to :ref:`2 `. + +#. .. _admin_guide_000024__li11411382418: + + Run the **sh client_batch_upgrade.sh -r -f /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_Client.tar -g /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_ClientConfig/batch_upgrade/client-info.cfg** command to batch update the **/etc/hosts** file on the nodes where the client resides. + + .. note:: + + - When you batch update the **/etc/hosts** file, the entered client package can be a complete client package or a client package that contains only configuration files (recommended). + - The user configured for the host where the **/etc/hosts** file needs to be updated must be **root**. Otherwise, the update fails. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_cluster_configurations.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_cluster_configurations.rst new file mode 100644 index 0000000..5a58208 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_cluster_configurations.rst @@ -0,0 +1,58 @@ +:original_name: admin_guide_000016.html + +.. _admin_guide_000016: + +Managing Cluster Configurations +=============================== + +Scenario +-------- + +FusionInsight Manager allows you to view the changes of service configuration parameters in a cluster with one click, helping you quickly locate faults and improve configuration management efficiency. + +You can quickly view all non-default values of each service in the cluster, non-uniform values between instances of the same role, historical records of cluster configuration modifications, and expired parameters in the cluster on the configuration page. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Configurations**. +#. Select an operation page based on the scenario. + + - To view all non-default values: + + a. Click **All Non-default Values**. The system displays the parameters whose values are different from the default values configured for each service, role, or instance in the current cluster. + + You can click |image1| next to a parameter value to quickly restore the value to the default one. You can click |image2| to view the historical modification records of the parameter. + + If there are a large number of parameters to configure, you can filter the parameters in the filter box in the upper right corner of the page or enter keywords in the search box. + + b. To change the values of the parameters, change the values according to the parameter description and click **Save**. In the dialog box that is displayed, click **OK**. + + - To view all non-uniform values: + + a. Click **All Non-uniform Values**. The system displays parameters with different role, service, instance group, or instance configurations in the current cluster. + + You can click |image3| next to a parameter value and view the differences in the dialog box that is displayed. + + b. To change the value of a parameter, click |image4| to cancel the configuration difference or manually adjust the parameter value, click **OK**, and then click **Save**. In the dialog box that is displayed, click **OK**. + + - To check expired configurations: + + a. Click **Expired Configurations**. Expired configuration items in the current cluster are displayed. + b. You can filter services using the service filter box in the upper part of the page to view expired configurations of different services. Alternatively, you can enter keywords in the search box. + c. Expired configuration items do not take effect completely. Restart the services or instances whose configurations have expired in a timely manner. + + - To view historical configuration records: + + a. Click **Historical Configurations**. The historical configuration change records of the current cluster are displayed. You can view details about parameter value changes, including the service to which the parameter belongs, parameter values before and after the modification, and parameter files. + b. To restore a configuration change, click **Restore Configuration** in the **Operation** column of the target record. In the dialog box that is displayed, click **OK**. + + .. note:: + + Some configuration items take effect only after the corresponding services are restarted. After the configurations are saved, restart the services or instances whose configurations have expired in a timely manner. + +.. |image1| image:: /_static/images/en-us_image_0263899617.png +.. |image2| image:: /_static/images/en-us_image_0279536633.png +.. |image3| image:: /_static/images/en-us_image_0263899556.png +.. |image4| image:: /_static/images/en-us_image_0263899589.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_expired_configurations.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_expired_configurations.rst new file mode 100644 index 0000000..0875d55 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_expired_configurations.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000013.html + +.. _admin_guide_000013: + +Managing Expired Configurations +=============================== + +Scenario +-------- + +If a new configuration needs to be delivered to all services in the cluster, or **Configuration Status** of multiple services changes to **Expired** or **Failed** after a configuration is modified, the configuration parameters of these services are not synchronized and do not take effect. In this case, synchronize the configurations and restart related service instances for the cluster so that the new parameters take effect for all services. + +If the configuration of the services in the cluster has been synchronized but do not take effect, you need to restart the instances whose configuration has expired. + +Impact on the System +-------------------- + +- After synchronizing the cluster configuration, you need to restart the services whose configuration has expired. These services are unavailable during restart. +- The instances whose configuration has expired are unavailable during restart. + +Procedure +--------- + +**Synchronize the configuration.** + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Dashboard**. +#. On this page, choose **More** > **Synchronize Configuration**. +#. In the dialog box that is displayed, click **OK**. + +**Restart configuration-expired instances.** + +#. Choose **More** > **Restart Configuration-Expired Instances**. + +#. In the dialog box that is displayed, enter the password of the current login user and click **OK**. + +#. In the displayed dialog box, click **OK**. + + You can click **View Instance** to open the list of all expired instances and confirm that the instances have been restarted. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/configuring_cluster_static_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/configuring_cluster_static_resources.rst new file mode 100644 index 0000000..649fb00 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/configuring_cluster_static_resources.rst @@ -0,0 +1,93 @@ +:original_name: admin_guide_000019.html + +.. _admin_guide_000019: + +Configuring Cluster Static Resources +==================================== + +Scenario +-------- + +You can adjust resource base on FusionInsight Manager and customize resource configuration groups if you need to control service resources used on each node in a cluster or the available CPU or I/O quotas on each node at different time segments. + +Impact on the System +-------------------- + +- After a static service pool is configured, the configuration status of affected services is displayed as **Expired**. You need to restart the services. Services are unavailable during restart. +- After a static service pool is configured, the maximum number of resources used by each service and role instance cannot exceed the upper limit. + +Procedure +--------- + +**Modify the Resource Adjustment Base** + +#. On FusionInsight Manager, choose **Cluster**, click the name of the target cluster, and choose **Static Service Pool Configurations**. + +#. Click **Configuration** in the upper right corner. The page for configuring resource pools is displayed. + +#. Change the values of **CPU (%)** and **Memory (%)** in the **System Resource Adjustment Base** area. + + Modifying the system resource adjustment base changes the maximum physical CPU and memory usage on nodes by services. If multiple services are deployed on the same node, the maximum physical resource usage of all services cannot exceed the adjusted CPU or memory usage. + +#. Click **Next**. + + To modify parameters again, click **Previous**. + +**Modify the Default Resource Configuration Group** + +5. Click **default**. In the **Configure weight** table, set **CPU LIMIT(%)**, **CPU SHARE(%)**, **I/O(%)**, and **Memory(%)** for each service. + + .. note:: + + - The sum of **CPU LIMIT(%)** and **CPU SHARE(%)** used by all services can exceed 100%. + - The sum of **I/O(%)** used by all services can exceed 100% but cannot be 0. + - The sum of **Memory(%)** used by all services can be greater than, smaller than, or equal to 100%. + - **Memory(%)** cannot take effect dynamically and can only be modified in the default configuration group. + - **CPU LIMIT(%)** is used to configure the ratio of the number of CPU cores that can be used by a service to those can be allocated to related nodes. + - **CPU SHARE(%**) is used to configure the ratio of the time when a service uses a CPU core to the time when other services use the CPU core. That is, the ratio of time when multiple services compete for the same CPU core. + +6. Click **Generate detailed configurations based on weight configurations**. FusionInsight Manager generates the actual values of the parameters in the default weight configuration table based on the cluster hardware resources and allocation information. + +7. Click **OK**. + + In the displayed dialog box, click **OK**. + +**Add a Customized Resource Configuration Group** + +8. Determine whether to automatically adjust resource configurations at different time segments. + + - If yes, go to :ref:`9 `. + - If no, use the default configurations, and no further action is required. + +9. .. _admin_guide_000019__li1535819244375: + + Click **Configuration**, change the system resource adjustment base values, and click **Next**. + +10. Click **Add** to add a resource configuration group. + +11. In **Step 1: Scheduling Time**, click **Configuration**. + + The page for configuring the time policy is displayed. + + Modify the following parameters based on service requirements and click **OK**. + + - **Repeat**: If this parameter is selected, the customized resource configuration is applied repeatedly based on the scheduling period. If this parameter is not selected, set the date and time when the configuration of the group of resources can be applied. + - **Repeat Policy**: The available values are **Daily**, **Weekly**, and **Monthly**. This parameter is valid only when **Repeat** is selected. + - **On**: indicates the time period between the start time and end time when the resource configuration is applied. Set a unique time range. If the time range overlaps with that of an existing group of resource configuration, the time range cannot be saved. + + .. note:: + + - The default group of resource configuration takes effect in all undefined time segments. + - The newly added resource group is a parameter set that takes effect dynamically in a specified time range. + - The newly added resource group can be deleted. A maximum of four resource configuration groups that take effect dynamically can be added. + - Select a repetition policy. If the end time is earlier than the start time, the resource configuration ends in the next day by default. For example, if a validity period ranges from 22:00 to 06:00, the customized resource configuration takes effect from 22:00 on the current day to 06:00 on the next day. + - If the repetition policy types of multiple configuration groups are different, the time ranges can overlap. The policy types are listed as follows by priority from low to high: daily, weekly, and monthly. The following is an example. There are two resource configuration groups using the monthly and daily policies, respectively. Their application time ranges in a day overlap as follows: 04:00 to 07:00 and 06:00 to 08:00. In this case, the configuration of the group that uses the monthly policy prevails. + - If the repetition policy types of multiple resource configuration groups are the same, the time ranges of different dates can overlap. For example, if there are two weekly scheduling groups, you can set the same time range on different day for them, such as to 04:00 to 07:00, on Monday and Wednesday, respectively. + +12. Modify the resource configuration of each service in **Step 2: Weight Configuration**. + +13. Click **Generate detailed configuration**. FusionInsight Manager generates the actual values of the parameters in the default weight configuration table based on the cluster hardware resources and allocation information. + +14. Click **OK**. + + In the displayed dialog box, click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/index.rst new file mode 100644 index 0000000..9da8c97 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000017.html + +.. _admin_guide_000017: + +Managing Static Service Pools +============================= + +- :ref:`Static Service Resources ` +- :ref:`Configuring Cluster Static Resources ` +- :ref:`Viewing Cluster Static Resources ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + static_service_resources + configuring_cluster_static_resources + viewing_cluster_static_resources diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/static_service_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/static_service_resources.rst new file mode 100644 index 0000000..5765cc0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/static_service_resources.rst @@ -0,0 +1,45 @@ +:original_name: admin_guide_000018.html + +.. _admin_guide_000018: + +Static Service Resources +======================== + +Overview +-------- + +A cluster allocates static service resources to services Flume, HBase, HDFS, and YARN. The total volume of computing resources allocated to each service is fixed, and they are static. A tenant can exclusively use or share a service to obtain the resources required for running this service. + +Static Service Pool +------------------- + +Static service pools are used to specify service resource configurations. + +Static service pools centrally manage resources that can be used by each service. + +- Limits the total number of resources that can be used by each services. Specifically, the total number of CPU, I/O, and memory resources can be configured on the nodes where services Flume, HBase, HDFS, and YARN are deployed. +- Isolates the resources of services in a cluster from those of other services. In this way, the load of one service has very limited impact on other services. + +Scheduling Mechanism +-------------------- + +The time-based dynamic resource scheduling mechanism enables different volumes of static resources to be configured for services at different time, optimizing service running environments and improving the cluster efficiency. + +In a complex cluster environment, multiple services share resources in the cluster, but the resource service period of each service may be different. + +The following use a bank customer as an example: + +- The HBase query service is heavy in the daytime. +- The query service is light, but the Hive analysis service is heavy at night. + +If fixed resources are allocated to each service, the following problems may occur: + +- The query service cannot obtain sufficient resources while the resources for the analysis service are idle in the daytime. +- The analysis service cannot obtain sufficient resources while the resources for the query service are idle at night. + +As a result, the cluster resource utilization is low and the service capability is weak. Resolve the problem in the following ways: + +- Sufficient resources need to be configured for HBase in the daytime. +- Sufficient resources need to be configured for Hive at night. + +The time-based dynamic scheduling mechanism can efficiently utilize resources and run tasks. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/viewing_cluster_static_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/viewing_cluster_static_resources.rst new file mode 100644 index 0000000..0cd828c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/managing_static_service_pools/viewing_cluster_static_resources.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000020.html + +.. _admin_guide_000020: + +Viewing Cluster Static Resources +================================ + +Scenario +-------- + +The big data management platform can manage and isolate service resources that are not running on YARN using static service resource pools. The system supports time-based automatic adjustment of static service resource pools. This enables the cluster to automatically adjust the parameter values at different periods to ensure more efficient resource utilization. + +System administrators can view the monitoring indicators of resources used by each service in the static service pool on FusionInsight Manager. The monitoring indicators are as follows: + +- CPU usage of services +- Total disk I/O read rate of services +- Total disk I/O write rate of services +- Total used memory of services + +.. note:: + + After the multi-tenant function is enabled, the CPU, I/O, and memory usage of all HBase instances can be centrally managed. + +Procedure +--------- + +#. On FusionInsight Manager, choose **Cluster**, click the name of the target cluster, and choose **Static Service Pool Configurations**. +#. In the configuration group list, click a configuration group, for example, **default**. +#. Check the system resource adjustment base values. + + - **System Resource Adjustment Base** indicates the maximum volume of resources that can be used by each node in the cluster. If a node has only one service, the service exclusively occupies the available resources on the node. If a node has multiple services, all services share the available resources on the node. + - **CPU** indicates the maximum number of CPUs that can be used by services on a node. + - **Memory** indicates the maximum memory that can be used by services on a node. + +#. In **Chart**, view the metric data of the cluster service resource usage. + + .. note:: + + - You can click **Add Service to Chart** to add static service resource data of specific services (up to 12 services) to the chart. + - For details about how to manage a chart, see :ref:`Managing Monitoring Metric Reports `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/modifying_cluster_attributes.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/modifying_cluster_attributes.rst new file mode 100644 index 0000000..448ae72 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/modifying_cluster_attributes.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000015.html + +.. _admin_guide_000015: + +Modifying Cluster Attributes +============================ + +Scenario +-------- + +View basic cluster attributes on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Cluster Properties**. + + By default, you can view the cluster name, cluster description, product type, cluster ID, authentication mode, creation time, and installed components. + +#. Change the cluster name. + + a. Click |image1| and enter a new name. + + Enter 2 to 199 characters. Only letters, digits, underscores (_), hyphens (-), and spaces are allowed, and the name cannot start with a space. + + b. Click **OK** for the new cluster name to take effect. + +#. Modify the cluster description. + + a. Click |image2| and enter a new description. + + Enter a maximum of 199 characters. Only letters, digits, commas (,), periods (.), underscores (_), spaces, and newline characters (\\n) are allowed. + + b. Click **OK** for the new description to take effect. + +.. |image1| image:: /_static/images/en-us_image_0000001318563432.png +.. |image2| image:: /_static/images/en-us_image_0000001369864353.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/overview.rst new file mode 100644 index 0000000..bceec30 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/overview.rst @@ -0,0 +1,47 @@ +:original_name: admin_guide_000011.html + +.. _admin_guide_000011: + +Overview +======== + +Dashboard +--------- + +Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Dashboard** to view the status of the current cluster. + +On the **Dashboard** tab page, you can start, stop, perform a rolling restart of, synchronize configurations to, and perform other basic operations on the current cluster. + +.. _admin_guide_000011__table17943743105914: + +.. table:: **Table 1** Maintenance and management operations + + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | UI Portal | Description | + +================================================================================+============================================================================================================================================================================================================================================================================================================+ + | **Start** | Starts all services in the cluster. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Stop** | Stops all services in the cluster. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Restart** | Restarts all services in the cluster. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Rolling-restart Service** | Restarts all services in the cluster one at a time without interrupting workloads. For details about how to perform a rolling restart, see :ref:`Performing a Rolling Restart of a Cluster `. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Synchronize Configurations** | Enables new configuration parameters for all services in the cluster. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Restart Configuration-Expired Instances** | Restarts expired instances for all services in the cluster. For details, see :ref:`Managing Expired Configurations `. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Health Check** | Performs a health check on the OMS nodes, all services, and the rest nodes in the cluster. There are three types of check items: running status, related alarms, and custom monitoring metrics. The health check results are not always the same as the values of **Running Status** displayed on the GUI. | + | | | + | | You can export check results by clicking **Export** in the upper left corner of the checklist. If any issues are detected, you can click **View Help** to find a troubleshooting method. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Download Client** | Downloads the default client. For details, see :ref:`Downloading the Client `. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Export Installation Template** | Batch exports all installation configurations of the cluster, such as the cluster authentication mode, node information, and service configuration. You can use this function when you need to reinstall the cluster in the same environment. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Export Configurations** | Batch exports configurations of all services in the cluster. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Enter Maintenance Mode** and **More** > **Exit Maintenance Mode** | Enters or exits the cluster maintenance mode. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **O&M View** | Allows you to view services or hosts that are in the maintenance mode. | + +--------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/performing_a_rolling_restart_of_a_cluster.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/performing_a_rolling_restart_of_a_cluster.rst new file mode 100644 index 0000000..7101a34 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/cluster_management/performing_a_rolling_restart_of_a_cluster.rst @@ -0,0 +1,82 @@ +:original_name: admin_guide_000012.html + +.. _admin_guide_000012: + +Performing a Rolling Restart of a Cluster +========================================= + +Scenario +-------- + +A rolling restart is batch restarting all services in a cluster after they are modified or upgraded without interrupting workloads. + +You can perform a rolling restart of a cluster as needed. + +.. note:: + + - Certain services in a cluster do not support rolling restart. These services are restarted in normal mode during the rolling restart of the cluster. As a result, workloads may be interrupted. So, you need to determine whether to perform this operation as prompted. + - Configurations that must take effect immediately, for example, server port configurations, should be restarted in normal mode. + +Impact on the System +-------------------- + +A rolling restart takes a longer time and may affect service throughput and performance. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the target cluster* > **Dashboard**. On this tab page, choose **More** > **Rolling-restart Service**. + +#. In the dialog box that is displayed, enter the password of the current login user and click **OK**. + +#. Configure the parameters based on site requirements. + + .. _admin_guide_000012__en-us_topic_0118210076_t65f951fcfc8a4a37b6c7f3481125fe35: + + .. table:: **Table 1** Rolling restart parameters + + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================================================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Restart only instances with expired configurations in the cluster | Whether to restart only the modified instances in a cluster | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Enable rack strategy | Whether to enable the concurrent rack rolling restart strategy. This parameter takes effect only for roles that meet the rack rolling restart strategy. (The roles support rack awareness, and instances of the roles belong to two or more racks.) | + | | | + | | .. note:: | + | | | + | | This parameter is configurable only when a rolling restart is performed on HDFS or YARN. | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Nodes to Be Batch Restarted | Number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is **1**. | + | | | + | | .. note:: | + | | | + | | - This parameter is valid only when the batch rolling restart strategy is used and the instance type is DataNode. | + | | - This parameter is invalid when the rack strategy is enabled. In this case, the cluster uses the maximum number of instances (20 by default) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack. | + | | - This parameter is configurable only when a rolling restart is performed on HDFS, HBase, YARN, Kafka, Storm, or Flume. | + | | - This parameter for the RegionServer of HBase cannot be manually configured. Instead, it is automatically adjusted based on the number of RegionServer nodes. Specifically, if the number of RegionServer nodes is less than 30, the parameter value is **1**. If the number is greater than or equal to 30 and less than 300, the parameter value is **2**. If the number is greater than or equal to 300, the parameter value is 1% of the number (rounded-down). | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Interval | Interval between two batches of instances to be roll-restarted. The default value is **0**. | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Decommissioning Timeout Interval | Decommissioning interval for role instances during a rolling restart. The default value is **1800s**. | + | | | + | | Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot cannot be connected to new clients. Existing connections will be completed after a period of time. An appropriate timeout interval can ensure service continuity. | + | | | + | | .. note:: | + | | | + | | This parameter is configurable only when a rolling restart is performed on Hive or Spark2x. | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Fault Tolerance Threshold | Tolerance times when the rolling restart of instances fails to be batch executed. The default value is **0**, which indicates that the rolling restart task ends after any batch of instances fails to restart. | + +-------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Advanced parameters, such as **Data Nodes to Be Batch Restarted**, **Batch Interval**, and **Batch Fault Tolerance Threshold**, should be properly configured based on site requirements. Otherwise, services may be interrupted or cluster performance may be severely affected. + + Example: + + - If **Data Nodes to Be Batch Restarted** is set to an unnecessarily large value, a large number of instances are restarted concurrently. As a result, services are interrupted or cluster performance is severely affected due to too few working instances. + - If **Batch Fault Tolerance Threshold** is too large, services will be interrupted because a next batch of instances will be restarted after a batch of instances fails to restart. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/index.rst new file mode 100644 index 0000000..b2c2447 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000009.html + +.. _admin_guide_000009: + +Cluster +======= + +- :ref:`Cluster Management ` +- :ref:`Managing a Service ` +- :ref:`Instance Management ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + cluster_management/index + managing_a_service/index + instance_management/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/decommissioning_and_recommissioning_an_instance.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/decommissioning_and_recommissioning_an_instance.rst new file mode 100644 index 0000000..b16b118 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/decommissioning_and_recommissioning_an_instance.rst @@ -0,0 +1,70 @@ +:original_name: admin_guide_000040.html + +.. _admin_guide_000040: + +Decommissioning and Recommissioning an Instance +=============================================== + +Scenario +-------- + +Some role instances provide services for external services in distributed and parallel mode. Services independently store information about whether each instance can be used. Therefore, you need to use FusionInsight Manager to recommission or decommission these instances to change the instance running status. + +Some instances do not support the recommissioning and decommissioning functions. + +.. note:: + + The following roles support decommissioning and recommissioning: HDFS DataNode, YARN NodeManager, and HBase RegionServer. + + - If the number of the DataNodes is less than or equal to that of HDFS replicas, decommissioning cannot be performed. If the number of HDFS replicas is three and the number of DataNodes is less than four in the system, decommissioning cannot be performed. In this case, an error will be reported and force FusionInsight Manager to exit the decommissioning 30 minutes after FusionInsight Manager attempts to perform the decommissioning. + - During MapReduce task execution, files with 10 replicas are generated. Therefore, if the number of DataNode instances is less than 10, decommissioning cannot be performed. + - If the number of DataNode racks (the number of racks is determined by the number of racks configured for each DataNode) is greater than 1 before the decommissioning, and after some DataNodes are decommissioned, that of the remaining DataNodes changes to 1, the decommissioning will fail. Therefore, before decommissioning DataNode instances, you need to evaluate the impact of decommissioning on the number of racks to adjust the DataNodes to be decommissioned. + - If multiple DataNodes are decommissioned at the same time, and each of them stores a large volume of data, the DataNodes may fail to be decommissioned due to timeout. To avoid this problem, it is recommended that one DataNode be decommissioned each time and multiple decommissioning operations be performed. + +Procedure +--------- + +#. Perform the following steps to perform a health check for the DataNodes before decommissioning: + + a. Log in to the client installation node as a client user and switch to the client installation directory. + + b. For a security cluster, use user **hdfs** for permission authentication. + + .. code-block:: + + source bigdata_env #Configure client environment variables. + kinit hdfs #Configure kinit authentication. + Password for hdfs@HADOOP.COM: #Enter the login password of user hdfs. + + c. Run the **hdfs fsck / -list-corruptfileblocks** command, and check the returned result. + + - If "has 0 CORRUPT files" is displayed, go to :ref:`2 `. + - If the result does not contain "has 0 CORRUPT files" and the name of the damaged file is returned, go to :ref:`1.d `. + + d. .. _admin_guide_000040__en-us_topic_0046737055_step_1e: + + Run the **hdfs dfs -rm** *Name of the damaged file* command to delete the damaged file. + + .. note:: + + Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation. + +#. .. _admin_guide_000040__en-us_topic_0046737055_step_2: + + Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. On the displayed page, click the **Instance** tab. + +#. Select the specified role instance to be decommissioned. + +#. Select **Decommission** or **Recommission** from the **More** drop-down list. + + In the displayed dialog box, enter the password of the current login user and click **OK**. + + Select **I confirm to decommission these instances and accept the consequence of service performance deterioration** and click **OK** to perform the corresponding operation. + + .. note:: + + During the instance decommissioning, if the service corresponding to the instance is restarted in the cluster using another browser, FusionInsight Manager displays a message indicating that the instance decommissioning is stopped, but the operating status of the instance is displayed as **Started**. In this case, the instance has been decommissioned on the background. You need to decommission the instance again to synchronize the operating status. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/index.rst new file mode 100644 index 0000000..bd16b49 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000037.html + +.. _admin_guide_000037: + +Instance Management +=================== + +- :ref:`Overview ` +- :ref:`Decommissioning and Recommissioning an Instance ` +- :ref:`Managing Instance Configurations ` +- :ref:`Viewing the Instance Configuration File ` +- :ref:`Instance Group ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + decommissioning_and_recommissioning_an_instance + managing_instance_configurations + viewing_the_instance_configuration_file + instance_group/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/configuring_instantiation_group_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/configuring_instantiation_group_parameters.rst new file mode 100644 index 0000000..aeca9b6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/configuring_instantiation_group_parameters.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000048.html + +.. _admin_guide_000048: + +Configuring Instantiation Group Parameters +========================================== + +Scenario +-------- + +In a large cluster, users can configure parameters for multiple instances in batches by configuring the related instance groups on FusionInsight Manager, reducing redundant instance configuration items and improving system performance. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. +#. Click the specified service name on the service management page. +#. On the displayed page, click the **Instance Groups** tab. +#. In the navigation tree, select the instance group name of a role, and switch to the **Configuration** tab page. Adjust parameters to be modified, and click **Save**. The configuration takes effect for all instances in the instance group. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/index.rst new file mode 100644 index 0000000..2756cee --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000045.html + +.. _admin_guide_000045: + +Instance Group +============== + +- :ref:`Managing Instance Groups ` +- :ref:`Viewing Information About an Instance Group ` +- :ref:`Configuring Instantiation Group Parameters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_instance_groups + viewing_information_about_an_instance_group + configuring_instantiation_group_parameters diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/managing_instance_groups.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/managing_instance_groups.rst new file mode 100644 index 0000000..c564194 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/managing_instance_groups.rst @@ -0,0 +1,89 @@ +:original_name: admin_guide_000046.html + +.. _admin_guide_000046: + +Managing Instance Groups +======================== + +Scenario +-------- + +Instance groups can be managed on FusionInsight Manager. That is, you can group multiple instances in the same role based on a specified principle, such as the nodes with the same hardware configuration. The modification on the configuration parameters of an instance group applies to all instances in the group. + +In a large cluster, instance groups are used to improve the capability of managing instances in batches in the heterogeneous environment. After instances are grouped, the instances can be configured repeatedly to reduce redundant instance configuration items and improve system performance. + +Creating an Instance Group +-------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. + +#. On the displayed page, click the **Instance Groups** tab. + + Click |image1| and configure parameters as prompted. + + .. table:: **Table 1** Instance group configuration parameters + + +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +====================+=====================================================================================================================================================================================================================================================================+ + | **The group name** | Indicates the instance group name. The value can contain only letters, digits, underscores (_), hyphens (-), and spaces. It must start with a letter, digit, underscore (_), or hyphen (-) and cannot ends with a space. It can contain a maximum of 99 characters. | + +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Role** | Indicates the role to which an instance group belongs. | + +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Copy From** | Indicates that the parameter values of a specified instance group are copied to the parameters of a new group. If the value is null, the default values are used for the parameters of the new group. | + +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Description** | Indicates the instance group description. It can contain only letters, digits, commas (,), periods (.), underscores (_), spaces, and line breaks, and can contain a maximum of 200 characters. | + +--------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + - Each instance must belong to only one instance group. When an instance is installed for the first time, it belongs to the instance group *Role name*\ ``-``\ **DEFAULT** by default. + - You can delete unnecessary or unused instance groups. Before deleting an instance group, migrate all instances in the group to other instance groups, and then delete the instance group by referring to :ref:`Deleting an Instance Group `. The default instance group cannot be deleted. + +#. Click **OK**. + + The instance group is created. + +Modifying Properties of an Instance Group +----------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. + +#. Click the **Instance Groups** tab. On the **Instance Groups** tab page, locate the row that contains the target instance group. + + Click |image2| and modify parameters as prompted. + +#. Click **OK** to save the modifications. + + The default instance group cannot be modified. + +.. _admin_guide_000046__section10369132812451: + +Deleting an Instance Group +-------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. + +#. Click the **Instance Groups** tab. On the **Instance Groups** tab page, locate the row that contains the target instance group. + +#. Click |image3|. + +#. In the displayed dialog box, click **OK**. + + The default instance group cannot be deleted. + +.. |image1| image:: /_static/images/en-us_image_0263899383.png +.. |image2| image:: /_static/images/en-us_image_0000001318441686.png +.. |image3| image:: /_static/images/en-us_image_0000001318122266.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/viewing_information_about_an_instance_group.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/viewing_information_about_an_instance_group.rst new file mode 100644 index 0000000..0e6e692 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/instance_group/viewing_information_about_an_instance_group.rst @@ -0,0 +1,34 @@ +:original_name: admin_guide_000047.html + +.. _admin_guide_000047: + +Viewing Information About an Instance Group +=========================================== + +Scenario +-------- + +The cluster administrator can view the instance group of a specified service on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. +#. Click the specified service name on the service management page. +#. On the displayed page, click the **Instance Groups** tab. +#. In the navigation tree, select a role. On the **Basic** tab page, view all instances in the instance group. + + .. note:: + + To move an instance from an instance group to another, perform the following operations: + + a. Select the instance to be moved and click **Move**. + + b. In the displayed dialog box, select an instance group to which the instance to be moved. + + During the migration, the configuration of the new instance group is automatically inherited. If the instance configuration is modified before the migration, the configuration of the instance prevails. + + c. Click **OK**. + + Restart the expired service or instance for the configuration to take effect. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/managing_instance_configurations.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/managing_instance_configurations.rst new file mode 100644 index 0000000..a52c1de --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/managing_instance_configurations.rst @@ -0,0 +1,55 @@ +:original_name: admin_guide_000043.html + +.. _admin_guide_000043: + +Managing Instance Configurations +================================ + +Scenario +-------- + +Configuration parameters of each role instance can be modified. In the scenario where instances are migrated to a new cluster or the service is redeployed, the cluster administrator can import or export all configuration data of a service on FusionInsight Manager to quickly copy configuration results. + +FusionInsight Manager can manage configuration parameters of a single role instance. Modifying configuration parameters and importing or exporting instance configurations do not affect other instances. + +Impact on the System +-------------------- + +After modifying the configuration of a role instance, you need to restart the instance if the instance status is **Expired**. The role instance is unavailable during restart. + +Modifying Instance Configuration +-------------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. On the page that is displayed, click the **Instance** tab. + +#. Click the specified instance and select **Instance Configuration**. + + By default, **Basic Configuration** is displayed. To modify more parameters, click **All Configurations**. All parameter categories supported by the instance are displayed on the **All Configurations** tab page. + +#. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + +#. Click **Save**. In the confirmation dialog box, click **OK**. + + Wait until the message "Operation succeeded." is displayed. Click **Finish**. + + The configuration is modified. + + .. note:: + + After the configuration parameters of a role instance are modified, you need to restart the instance if the instance status is **Expired**. You can select the expired instance on the **Instances** page and choose **More** > **Restart Instance**. + +Exporting/Importing Instance Configuration +------------------------------------------ + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. +#. Click the specified service name on the service management page. On the displayed page, click the **Instance** tab. +#. Click the specified instance and select **Instance Configurations**. +#. Click **Export** to export the configuration parameter file to the local host. +#. On the **Instance Configurations** page, click **Import**, select the configuration parameter file of the instance, and import the file. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/overview.rst new file mode 100644 index 0000000..171a52b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/overview.rst @@ -0,0 +1,94 @@ +:original_name: admin_guide_000038.html + +.. _admin_guide_000038: + +Overview +======== + + +Overview +-------- + +Log in to FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Service** > **KrbServer**. On the displayed page, click **Instance**. The displayed instance management page contains the function area and role instance list. + +Functional Area +--------------- + +After selecting the instances to be operated in the function area, you can maintain and manage the role instances, such as starting or stopping the instances. :ref:`Table 1 ` shows the main operations. + +.. _admin_guide_000038__table17943743105914: + +.. table:: **Table 1** Instance maintenance and management + + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | UI Portal | Description | + +===============================================================+================================================================================================================================================================================================================================================================================================================+ + | **Start Instance** | Start a specified instance in the cluster. You can start a role instance in the **Not Started**, **Stop Failed**, or **Startup Failed** state to use the role instance. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More > Stop Instance** | Stop a specified instance in the cluster. You can stop a role instance that is no longer used or is abnormal. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More > Restart Instance** | Restart a specified instance in the cluster. You can restart an abnormal role instance to restore it. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More > Instance Rolling Restart** | Restart a specified instance in the cluster without interrupting services. For details about the parameter settings, see :ref:`Performing a Rolling Restart of a Cluster `. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More > Decommission/Recommission** | Recommission or decommission a specified instance in the cluster to change the service availability status of the service. For details, see :ref:`Decommissioning and Recommissioning an Instance `. | + | | | + | | .. note:: | + | | | + | | Only the role DataNode in HDFS, role NodeManager in Yarn, and role RegionServer in HBase support the recommissioning and decommissioning functions. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | *Desired instance* > **More** > **Synchronize Configuration** | If the **Configuration Status** of a role instance is **Expired**, the role instance has not been restarted after the configuration is modified, and the new configuration is saved only on FusionInsight Manager. In this case, use this function to deliver the new configuration to the specified instance. | + | | | + | | .. note:: | + | | | + | | - After synchronizing the role instance configuration, you need to restart the role instance whose configuration has expired. The role instance is unavailable during the restart. | + | | - After the synchronization is complete, restart the instance for the configuration to take effect. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | *Desired instance* > **Instance Configurations** | For details, see :ref:`Managing Instance Configurations `. | + +---------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +You can filter instances based on the role they belong to or their running status in the function area. + +.. note:: + + Click **Advanced Search** to search for specified instances by specifying other filter criteria, such as **Host Name**, **Management IP Address**, **Business IP Address**, or **Instance Groups**. + +Role Instance List +------------------ + +The role instance list contains the instances of all roles in the cluster. The list displays the running status, configuration status, hosts, and related IP addresses of each instance. + +.. table:: **Table 2** Instance running status + + +---------------------+----------------------------------------------------------------------------------------------------------+ + | Status | Description | + +=====================+==========================================================================================================+ + | **Normal** | Indicates that the instance is running properly. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Faulty** | Indicates that the instance cannot run properly. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Decommissioned** | Indicates that the instance is out of service. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Not started** | Indicates that the instance is stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Unknown** | Indicates that the initial status of the instance cannot be detected. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Starting** | Indicates that the instance is being started. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Stopping** | Indicates that the instance is being stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Restoring** | Indicates that an exception may occur in the instance and the instance is being automatically rectified. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Decommissioning** | Indicates that the instance is being decommissioned. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Recommissioning** | Indicates that the instance is being recommissioned. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Failed to start** | Indicates that the service fails to be started. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Failed to stop** | Indicates that the service fails to be stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + +Instance Details +---------------- + +You can click an instance name to go to the instance details page and view the basic information, configuration file, instance logs, and monitoring metric reports of the instance. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/viewing_the_instance_configuration_file.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/viewing_the_instance_configuration_file.rst new file mode 100644 index 0000000..aa3323c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/instance_management/viewing_the_instance_configuration_file.rst @@ -0,0 +1,30 @@ +:original_name: admin_guide_000044.html + +.. _admin_guide_000044: + +Viewing the Instance Configuration File +======================================= + +Scenario +-------- + +FusionInsight Manager allows O&M personnel to view the content configuration files such as environment variables and role configurations of the instance node on the management page. If O&M personnel need to quickly check whether configuration items of the instance are incorrectly configured or when some hidden configuration items need to be viewed, the O&M personnel can directly view the configuration files on FusionInsight Manager. In this case, users quickly analyze configuration problems. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Service**. + +#. Click the specified service name on the service management page. On the displayed page, click the **Instance** tab. + +#. Click the name of the target instance. In the **Configuration File** area on the **Instance Status** page, the configuration file list of the instance is displayed. + +#. Click the name of the configuration file to be viewed to view the parameter values in the configuration file. + + To obtain the configuration file, you can download the configuration file to the local PC. + + .. note:: + + If a node in the cluster is faulty, the configuration file cannot be viewed. Rectify the fault before viewing the configuration file again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/index.rst new file mode 100644 index 0000000..8c2d3eb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000026.html + +.. _admin_guide_000026: + +Managing a Service +================== + +- :ref:`Overview ` +- :ref:`Other Service Management Operations ` +- :ref:`Service Configuration ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + other_service_management_operations/index + service_configuration/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/collecting_stack_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/collecting_stack_information.rst new file mode 100644 index 0000000..c5e7889 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/collecting_stack_information.rst @@ -0,0 +1,44 @@ +:original_name: admin_guide_000033.html + +.. _admin_guide_000033: + +Collecting Stack Information +============================ + +Scenario +-------- + +To meet actual service requirements, the cluster administrator can collect stack information about a specified role or instance on FusionInsight Manager, save the information to a local directory, and download the information. The following information can be collected: + +#. jstack information. +#. jmap -histo information. +#. jmap -dump information. +#. Thr jstack and jmap-histo information can be collected continuously for comparison. + +Procedure +--------- + +**Collecting stack information** + +#. Log in to FusionInsight Manager. +#. Click **Cluster**, click the name of the desired cluster, click **Services**, and click the target service. +#. On the displayed page, Choose **More** > **Collect Stack Information**. + + .. note:: + + - To collect stack information of multiple instances, go to the instance list, select the desired instances in the instance list and choose **More** > **Collect Stack Information**. + - To collect stack information of a single instance, click the desired instance and choose **More** > **Collect Stack Information**. + +#. In the displayed dialog box, select the desired role and content, configure advanced options (retain the default settings if there is no special requirement), and click **OK**. +#. After the collection is successful, click **Download**. + +**Downloading Stack Information** + +6. Click **Cluster**, click the name of the desired cluster, click **Services**, and click the target service. Choose **More** > **Download Stack Information** in the upper right corner. +7. Select the desired role and content and click **Download** to download the stack information to the local PC. + +**Clearing stack information** + +8. Click **Cluster**, click the name of the desired cluster, click **Services**, and click the target service. +9. Choose **More** > **Clear Stack Information** in the upper right corner. +10. Select the desired role and content and configure **File Directory**. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/index.rst new file mode 100644 index 0000000..99fe7e5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000029.html + +.. _admin_guide_000029: + +Other Service Management Operations +=================================== + +- :ref:`Service Details Page ` +- :ref:`Performing Active/Standby Switchover of a Role Instance ` +- :ref:`Resource Monitoring ` +- :ref:`Collecting Stack Information ` +- :ref:`Switching Ranger Authentication ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + service_details_page + performing_active_standby_switchover_of_a_role_instance + resource_monitoring + collecting_stack_information + switching_ranger_authentication diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/performing_active_standby_switchover_of_a_role_instance.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/performing_active_standby_switchover_of_a_role_instance.rst new file mode 100644 index 0000000..05846ba --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/performing_active_standby_switchover_of_a_role_instance.rst @@ -0,0 +1,29 @@ +:original_name: admin_guide_000031.html + +.. _admin_guide_000031: + +Performing Active/Standby Switchover of a Role Instance +======================================================= + +Scenario +-------- + +Some service roles are deployed in active/standby mode. If the active instance needs to be maintained and cannot provide services, or other maintenance is required, you can manually trigger an active/standby switchover. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. +#. Click the specified service name on the service management page. +#. On the service details page, expand the **More** drop-down list and select **Perform Role Instance Switchover**. +#. In the displayed dialog box, enter the password of the current login user and click **OK**. +#. In the displayed dialog box, click **OK** to perform active/standby switchover for the role instance. + + .. note:: + + - The Manager component package only supports the active/standby switchover of DBService role instances. + - The HD component package supports the active/standby switchover of the following service role instances: HDFS, YARN, Storm, HBase, and MapReduce. + - When an active/standby switchover is performed for a NameNode on HDFS, a NameService must be set. + - The Porter component package only supports the active/standby switchover of Loader role instances. + - This function cannot be used for other role instances. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/resource_monitoring.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/resource_monitoring.rst new file mode 100644 index 0000000..d364142 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/resource_monitoring.rst @@ -0,0 +1,107 @@ +:original_name: admin_guide_000032.html + +.. _admin_guide_000032: + +Resource Monitoring +=================== + +Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**, and click **Resource**. The resource monitoring page is displayed. + +Some services in the cluster provide service-level resource monitoring metrics. By default, the monitoring data of the latest 12 hours is displayed. You can click |image1| to customize a time range. Time range options are **12h**, **1d**, **1w**, and **1m**. You can click |image2| to export the corresponding report information. If a monitoring item has no data, the report cannot be exported. :ref:`Table 1 ` lists the services and monitoring items that support resource monitoring. + +.. _admin_guide_000032__tdde6b8099f4945c88264ffdb296e0eb9: + +.. table:: **Table 1** Service resource monitoring + + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Metrics | Description | + +=======================+================================================+==============================================================================================================================================================================================================================================================================+ + | HDFS | Resource Usage (by Tenant) | - Collects statistics on HDFS resource usage by tenant. | + | | | - Views the metrics **Capacity** or **Number of File Objects**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource Usage (by User) | - Collects statistics on HDFS resource usage by user. | + | | | - Views the metrics **Used Capacity** or **Number of File Objects**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource Usage (by Directory) | - Collects statistics on HDFS resource usage by directory. | + | | | - Views the metrics **Used Capacity** or **Number of File Objects**. | + | | | - You can click |image3| to configure space monitoring. Alternatively, you can specify an HDFS file system directory for monitoring. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource Usage (by Replica) | - Collects statistics on HDFS resource usages by replica count. | + | | | - Views the metrics **Used Capacity** or **File Count**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource Usage (by File Size) | - Collects statistics on HDFS resource usages by file size. | + | | | - Views the metrics **Used Capacity** or **File Count**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Recycle Bin (by User) | - Collects statistics on the usage of the HDFS recycle bin by user. | + | | | - Views the metrics **Recycle Bin Capacity** or **Number of File Objects**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Operation Count | - Collects the number of operations in HDFS. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Automatic Balancer | - Collects statistics on the execution speed of HDFS automatic balancer and the total capacity of the current balancer migration. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | NameNode RPC Open Connections (by User) | - Displays the number of connections of each user in the Client RPC requests connected to NameNodes. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Slow DataNodes | Displays DataNode that transmits or processes data slowly in the cluster. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Slow Disks | Displays the disk that processes data slowly on the DataNode in the cluster. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HBase | Operation Requests in Tables | Displays the number of PUT, DELETE, GET, SCAN, INCREMENT, and APPEND operation requests in all tables on all RegionServers. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Operation Requests on RegionServers | Displays the number of PUT, DELETE, GET, SCAN, INCREMENT, and APPEND operation requests and number of all operation requests in RegionServer. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Operation Requests for Service | Displays the number of PUT, DELETE, GET, SCAN, INCREMENT, and APPEND operation requests in all regions on RegionServers. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HFiles on RegionServers | Displays the number of HFiles in all RegionServers. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hive | HiveServer2-Background-Pool Threads (by IP) | Displays the number of HiveServer2-Background-Pool threads of top users. These threads are measured and displayed in a measurement period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HiveServer2-Handler-Pool Threads (by IP) | Displays the number of HiveServer2-Handler-Pools of top users collected and displayed in a period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Used MetaStore Number (by IP) | Collects statistics on and displays the MetaStore usage of top users in a period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Number of Hive jobs | Displays the number of user-related jobs collected by Hive in a period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Number of Files Accessed in the Split Phase | Displays the number of files accessed by the underlying file storage system (HDFS by default) in the Split phase in a period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Hive Basic Operation Time | Collects time for creating a directory (mkdirTime), creating a file (touchTime), writing a file (writeFileTime), renaming a file (renameTime), moving a file (moveTime), deleting a file (deleteFileTime), and deleting a directory (deleteCatalogTime) in a period of time. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Table Partitions | Displays the number of partitions in all Hive tables, which is displayed in the following format: *database* # *table name*, *number of table partitions*. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HQL Map Count | Collects statistics on HQL statements executed in a period and the number of Map statements invoked during the execution. The displayed information includes users, HQL statements, and the number of Map statements. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HQL Access Statistics | Displays the number of HQL access times in a period. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kafka | Kafka Disk Usage Distribution | Displays the disk usage distribution statistics of the Kafka cluster. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Spark2x | HQL Access Statistics | Collects HQL access statistics in a period, including the username, HQL statement, and HQL statement execution times. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Yarn | Used resources (by task) | - Displays the number of CPU cores and memory used by a task. | + | | | - Views the metrics **By memory** or **By CPU**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource usage (by tenant) | - Displays the number of CPU cores and memory used by a tenant. | + | | | - Views the metrics **By memory** or **By CPU**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Resource usage ratio (by tenant) | - Displays the ratio of the number of CPU cores to the memory used by a tenant. | + | | | - Views the metrics **By memory** or **By CPU**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Task Duration Ranking | Displays Yarn tasks sorted by time consumption. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ResourceManager RPC Open Connections (by User) | Displays the number of client RPC connections to ResourceManager by user. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Operation Count | Collects statistics on the number and proportion of operations corresponding to each Yarn operation type. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Ranking of Tasks in a Queue by Resource Usage | - Displays the resources consumed by the tasks running in a queue after the queue (tenant) is selected on the GUI. | + | | | - Views the metrics **By memory** or **By CPU**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Ranking of Users in a Queue by Resource Usage | - Displays the resources consumed by the users who are running tasks in the queue after a queue (tenant) is selected on the GUI. | + | | | - Views the metrics **By memory** or **By CPU**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper | Used Resources (By Second-Level Znode) | - Displays the ZooKeeper level-2 znode resource status. | + | | | - Views the metrics **By Znode quantity** or **By capacity**. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Number of Connections (by Client IP Address) | Displays the ZooKeeper client connection resource status. | + +-----------------------+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001369965777.png +.. |image2| image:: /_static/images/en-us_image_0000001318119582.png +.. |image3| image:: /_static/images/en-us_image_0263899644.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/service_details_page.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/service_details_page.rst new file mode 100644 index 0000000..1aca37c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/service_details_page.rst @@ -0,0 +1,82 @@ +:original_name: admin_guide_000030.html + +.. _admin_guide_000030: + +Service Details Page +==================== + +Overview +-------- + +Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services**. In the service list, click the specified service name to go to the service details page, including the **Dashboard**, **Instance**, **Instance Groups** and **Configurations** tab pages as well as function areas. For some services, the custom management tool page can be displayed.For details about the supported management tools, see :ref:`Table 1 `.. + +.. _admin_guide_000030__table1936171313284: + +.. table:: **Table 1** Customized management tools + + +------------------------------+---------+-------------------------------------------------------------------+ + | Tool | Service | Description | + +==============================+=========+===================================================================+ + | Flume configuration tool | Flume | Configures collection parameters for the Flume server and client. | + +------------------------------+---------+-------------------------------------------------------------------+ + | Flume client management tool | Flume | Views the monitoring information about the Flume client. | + +------------------------------+---------+-------------------------------------------------------------------+ + | Kafka topic monitoring tool | Kafka | Monitors and manages Kafka topics. | + +------------------------------+---------+-------------------------------------------------------------------+ + +The **Dashboard** page is the default page, which contains the basic information, role list, dependency table, and monitoring chart, and more. You can manage services in the upper right corner. For details about basic service management, such as starting, stopping, rolling restart, and synchronization configuration, see :ref:`Table 3 `. For details about other service management operations, see :ref:`Table 2 `. + +.. _admin_guide_000030__table15518145510269: + +.. table:: **Table 2** Service management operations + + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Navigation Path | Description | + +============================================+============================================================================================================================================================================================================================================================================+ + | **More** > **Health Check** | Performs a health check for the current service. The health check items include the health status of each check object, related alarms, and user-defined monitoring indicators. The check result is not the same as the values of **Running Status** displayed on the GUI. | + | | | + | | To export the result of the health check, click **Export Report** in the upper left corner of the checklist. If you find any problem, click **View Help**. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Download Client** | Download the default client that contains only specific services and perform management operations, run services, or perform secondary development on the client. For details, see :ref:`Downloading the Client `. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Change Service Name** | Changes the name of the current service. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Perform** *XX* **Switchover** | For details, see :ref:`Performing Active/Standby Switchover of a Role Instance `. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **More** > **Enter/Exit Maintenance Mode** | Configures a service to enter/exit the maintenance mode. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Configurations** > **Import/Export** | In the scenario where services are migrated to a new cluster or the same services are deployed again, you can import or export all configuration data of a specific service to quickly copy the configuration results. | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Basic Information Area +---------------------- + +The basic information area on the **Dashboard** tab page contains the basic status data of the service, including the running status, configuration details, version, and key information of the service. If the service supports the open-source web UIs, you can access the open-source web UIs by clicking the links in the basic information area. + +.. note:: + + In the current version, user **admin** does not have the permission to access all the service functions provided on the open source web UI. Create a component service administrator to access the WebUI address. + +Role List +--------- + +The role list on the **Dashboard** tab page contains all roles of the service. The role list displays the running status and the number of instances of each role. + +Dependency +---------- + +The dependency relationship table on the **Dashboard** tab page displays the services on which the current service depends and other services that depend on the service. + +Historical Records of Alarms and Events +--------------------------------------- + +The alarm and event history area displays the key alarms and events reported by the current service. Up to 20 historical records are displayed. + +Chart +----- + +The chart area is displayed on the right of the **Dashboard** tab page and contains the key monitoring indicator report of the service. You can customize the monitoring report that is displayed in the chart area, view the description of the monitoring metrics, or export the monitoring data. For a customized resource contribution chart, you can zoom in on the chart and switch between the trend chart and distribution chart. + +.. note:: + + Some services in the cluster provide service-level resource monitoring items. For details, see :ref:`Resource Monitoring `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/switching_ranger_authentication.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/switching_ranger_authentication.rst new file mode 100644 index 0000000..6f02d9e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/other_service_management_operations/switching_ranger_authentication.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000415.html + +.. _admin_guide_000415: + +Switching Ranger Authentication +=============================== + +Scenario +-------- + +By default, the Ranger service is installed and Ranger authentication is enabled for a newly installed cluster in security mode. You can set fine-grained security access policies for accessing component resources through the permission plug-in of the component. If Ranger authentication is not required, the cluster administrator can manually disable Ranger authentication on the service page. After Ranger authentication is disabled, the system continues to perform permission control based on the role model of FusionInsight Manager when accessing component resources. + +In a cluster upgraded from an earlier version, Ranger authentication is not used by default when users access component resources. The cluster administrator can manually enable Ranger authentication after installing the Ranger service. + +.. note:: + + - In a cluster in security mode, the following components support Ranger authentication: HDFS, YARN, Kafka, Hive, HBase, Storm,, Impala, and Spark2x. + - In a cluster in non-security mode, Ranger supports permission control on component resources based on OS users. The following components support Ranger authentication: HBase, HDFS, Hive, Spark2x, and YARN. + - After Ranger authentication is enabled, all authentication of the component will be managed by Ranger. The permissions set by the original authentication plug-in will become invalid (The ACL rules of HDFS and YARN components still take effect). Exercise caution when performing this operation. You are advised to deploy permissions on Ranger in advance. + - After Ranger authentication is disabled, all authentication of the component will be managed by the permission plug-in of the component. The permission set on Ranger will become invalid. Exercise caution when performing this operation. You are advised to deploy permissions on Manager in advance. + +Enabling Ranger Authentication +------------------------------ + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > **Services**. +#. Click the specified service name on the service management page. +#. On the service details page, expand the **More** drop-down list and select **Enable Ranger**. +#. In the displayed dialog box, enter the password of the current login user and click **OK**. +#. In the service list, restart the service whose configuration has expired. + +Disabling Ranger Authentication +------------------------------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > **Services**. +#. Click the specified service name on the service management page. +#. On the service details page, expand the **More** drop-down list and select **Disable Ranger**. +#. Enter the password of the current login user and click **OK**. In the displayed dialog box, click **OK**. +#. In the service list, restart the service whose configuration has expired. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/overview.rst new file mode 100644 index 0000000..659f380 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/overview.rst @@ -0,0 +1,169 @@ +:original_name: admin_guide_000027.html + +.. _admin_guide_000027: + +Overview +======== + +Dashboard +--------- + +Log in to FusionInsight Manager. Choose **Cluster**, click the name of the desired cluster, and choose **Services**. The service management page is displayed, including the functional area and service list. + +Functional Area +--------------- + +In the functional area of the service management page, you can select a view type and filter and search for services by service type. You can use the advanced search to select required services based on the running status and configuration status. + +Service List +------------ + +The service list on the service management page contains all installed services in the cluster. If the tile view is selected, the services will be displayed in pane style. If you select the list view, the services will be displayed in a table. + +.. note:: + + In this section, the **Tile View** is used by default. + +The service list displays the running status, configuration status, role type, and number of instances of each service. On this page, you can perform some service maintenance tasks, such as starting, stopping, and restarting services. + +.. table:: **Table 1** Service running status + + +-----------------------+----------------------------------------------------------------------+ + | Status | Description | + +=======================+======================================================================+ + | **Normal** | Indicates that the service is running properly. | + +-----------------------+----------------------------------------------------------------------+ + | **Faulty** | Indicates that the service cannot run properly. | + +-----------------------+----------------------------------------------------------------------+ + | **Partially Healthy** | Indicates that some enhanced functions of the service are abnormal. | + +-----------------------+----------------------------------------------------------------------+ + | **Not started** | Indicates that the service is stopped. | + +-----------------------+----------------------------------------------------------------------+ + | **Unknown** | Indicates that the initial status of the service cannot be detected. | + +-----------------------+----------------------------------------------------------------------+ + | **Starting** | Indicates that the service is being started. | + +-----------------------+----------------------------------------------------------------------+ + | **Stopping** | Indicates that the service is being stopped. | + +-----------------------+----------------------------------------------------------------------+ + | **Failed to start** | Indicates that the service fails to be started. | + +-----------------------+----------------------------------------------------------------------+ + | **Failed to stop** | Indicates that the service fails to be stopped. | + +-----------------------+----------------------------------------------------------------------+ + +.. note:: + + - If the running status of a service is **Faulty**, an alarm is generated. Rectify the fault based on the alarm information. + - HBase, Hive, Spark, and Loader may be in the **Subhealthy** state. + + - If Yarn is installed but is abnormal, HBase is in the **Subhealthy** state. If the multi-instance function is enabled, all installed HBase service instances are in the **Subhealthy** state. + - If HBase is installed but is abnormal, Hive, Spark, and Loader are in the **Subhealthy** state. + - If any HBase instance is installed but is abnormal after the multi-instance function is enabled, Loader is in the **Subhealthy** state. + - If an HBase instance is installed but is abnormal after the multi-instance function is enabled, the Hive and Spark instances that map to the HBase instance are in the **Subhealthy** state. That is, if HBase 2 is installed but is abnormal, Hive 2 and Spark2 are in the **Subhealthy** state. + +.. table:: **Table 2** Service configuration status + + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +===================+===================================================================================================================================================================================================================================================================================================+ + | **Synchronized** | Indicates that all service parameter settings have taken effect in the cluster. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Expired** | Indicates that the latest configuration is not synchronized and does not take effect after the service parameters are modified. You need to synchronize the configurations and restart the services. You can click |image2| next to **Configuration Status** to view expired configuration items. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Failed** | Indicates that a communication or read/write exception occurs during the parameter configuration synchronization. Use **Synchronize Configuration** to rectify the fault. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Synchronizing** | Indicates that the service parameter configuration is being synchronized. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Unknown** | Indicates that the initial status of the service cannot be detected. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +You can click a service in the service list to perform simple maintenance and management operations on the service, as described in :ref:`Table 3 `. + +.. _admin_guide_000027__table17943743105914: + +.. table:: **Table 3** Basic maintenance and management + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Menu Item on the UI | Description | + +===================================+===========================================================================================================================================================================================================================================================================================================================================================================================================+ + | Start Service | Start a specified service in the cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stop Service | Stop a specified service in the cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Restart Service | Restart a specified service in the cluster. | + | | | + | | .. note:: | + | | | + | | If a service is restarted, other services that depend on this service will be unavailable. Therefore, select **Restart upper-layer services**. Determine whether to perform this operation based on the displayed service list. Services are restarted one by one due to their dependency. :ref:`Table 4 ` describes the restart duration of a single service. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Rolling Restart | Restart a specified service in the cluster without interrupting services. For details about the parameter settings, see :ref:`Table 1 `. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Synchronize Configuration** | - Enable new configuration parameters for a specified service in the cluster. | + | | - Distribute new configuration parameters for services whose **Configuration Status** is **Expired**. | + | | | + | | .. note:: | + | | | + | | After some services are synchronized, restart the services for the settings to take effect. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _admin_guide_000027__table1143215941919: + +.. table:: **Table 4** Restart duration + + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Restart Duration | Startup Duration | Remarks | + +=================+==================+=============================+=======================================================================================================================================================================================================================================+ + | ClickHouse | 4 min | ClickHouseServer: 2 min | ``-`` | + | | | | | + | | | ClickHouseBalancer: 2 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HDFS | 10min+x | NameNode: 4 min + x | *x* indicates the NameNode metadata loading duration. It takes about 2 minutes to load 10,000,000 files. For example, *x* is 10 minutes for 50 million files. The startup duration fluctuates with reporting of DataNode data blocks. | + | | | | | + | | | DataNode: 2 min | | + | | | | | + | | | JournalNode: 2 min | | + | | | | | + | | | Zkfc: 2 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Yarn | 5 min + x | ResourceManager: 3 min + x | *x* indicates the time required for restoring ResourceManager reserved tasks. It takes about 1 minute to restore 10,000 reserved tasks. | + | | | | | + | | | NodeManager: 2 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MapReduce | 2 min + x | JobHistoryServer: 2 min + x | *x* indicates the scanning duration of historical tasks. It takes about 2.5 minutes to scan 100,000 tasks. | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper | 2 min + x | quorumpeer: 2 min + x | *x* indicates the duration for loading znodes. It takes about 1 minute to load 1 million znodes. | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hive | 3.5 min | HiveServer: 3 min | ``-`` | + | | | | | + | | | MetaStore: 1 min 30s | | + | | | | | + | | | WebHcat: 1 min | | + | | | | | + | | | Hive service: 3 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Spark2x | 5 min | JobHistory2x: 5 min | ``-`` | + | | | | | + | | | SparkResource2x: 5 min | | + | | | | | + | | | JDBCServer2x: 5 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flink | 4 min | FlinkResource: 1 min | ``-`` | + | | | | | + | | | FlinkServer: 3 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kafka | 2 min + x | Broker: 1 min + x | *x* indicates the data restoration duration. It takes about 2 minutes to start 20,000 partitions for a single instance. | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storm | 6 min | Nimbus: 3 min | ``-`` | + | | | | | + | | | UI: 1 min | | + | | | | | + | | | Supervisor: 1 min | | + | | | | | + | | | Logviewer: 1 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flume | 3 min | Flume: 2 min | ``-`` | + | | | | | + | | | MonitorServer: 1 min | | + +-----------------+------------------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0263899406.png +.. |image2| image:: /_static/images/en-us_image_0263899406.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/index.rst new file mode 100644 index 0000000..1bd8e3d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000034.html + +.. _admin_guide_000034: + +Service Configuration +===================== + +- :ref:`Modifying Service Configuration Parameters ` +- :ref:`Modifying Custom Configuration Parameters of a Service ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + modifying_service_configuration_parameters + modifying_custom_configuration_parameters_of_a_service diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_custom_configuration_parameters_of_a_service.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_custom_configuration_parameters_of_a_service.rst new file mode 100644 index 0000000..404e95a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_custom_configuration_parameters_of_a_service.rst @@ -0,0 +1,64 @@ +:original_name: admin_guide_000036.html + +.. _admin_guide_000036: + +Modifying Custom Configuration Parameters of a Service +====================================================== + +Scenario +-------- + +All open source parameters can be configured for all MRS cluster components. Parameters used in some key application scenarios can be modified on FusionInsight Manager, and some parameters of open source features may not be configured for some component clients. To modify the component parameters that are not directly supported by Manager, cluster administrators can add new parameters for components using the configuration customization function on Manager. Newly added parameters are saved in component configuration files and take effect after restart. + +Impact on the System +-------------------- + +- After configuring properties of a service, you need to restart the service if the service status is **Expired**. The service is unavailable during the restart. +- After the service configuration parameters are modified and then take effect after restart, you need to download and install the client again or download the configuration file to update the client. + +Prerequisites +------------- + +Cluster administrators have fully understood the meanings of the parameters to be added, configuration files to take effect, and the impact on components. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. + +#. Click **Configuration** and click **All Configurations**. + +#. In the navigation tree on the left, locate a level-1 node and select **Customization**. The system displays the customized parameters of the current component. + + The configuration files that save the newly added custom parameters are displayed in the **Parameter File** column. Different configuration files may have same open source parameters. After the parameters in different files are set to different values, the configuration takes effect depends on the loading sequence of the configuration files by components. You can customize parameters for services and roles as required. Adding customized parameters for a single role instance is not supported. + +#. Locate the row where a specified parameter resides, enter the parameter name supported by the component in the **Name** column and enter the parameter value in the **Value** column. + + You can click **+** or **-** to add or delete a customized parameter. + +#. Click **Save**. In the displayed **Save Configuration** dialog box, confirm the modification and click **OK**. After the system displays "Operation succeeded", click **Finish**. The configuration is saved successfully. + + Restart the expired service or instance for the configuration to take effect. + +Task Example (Configuring Customized Hive Parameters) +----------------------------------------------------- + +Hive depends on HDFS. By default, Hive accesses the HDFS client. The configuration parameters that have taken effect are controlled by HDFS. For example, the HDFS parameter **ipc.client.rpc.timeout** affects the RPC timeout interval for all clients to connect to the HDFS server. Cluster administrators can modify the timeout interval for Hive to connect to HDFS by configuring custom parameters. After this parameter is added to the **core-site.xml** file of Hive, this parameter can be identified by the Hive service and its configuration overwrites the parameter configuration in HDFS. + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and click **Services**. + +#. On the displayed page, click **Configuration** and click **All Configurations**. + +#. In the navigation tree on the left, select **Customization** for the Hive service. The system displays the custom service parameters supported by Hive. + +#. In **core-site.xml**, locate the row that contains the **core.site.customized.configs** parameter, enter **ipc.client.rpc.timeout** in the **Name** column, and enter a new value in the **Value** column, for example, 150000. The unit is ms. + +#. Click **Save**. In the displayed **Save Configuration** dialog box, confirm the modification and click **OK**. Wait until the message "Operation succeeded" is displayed, and click **Finish**. + + The configuration is saved successfully. + + After the configuration is saved, restart the expired service or instance for the configuration to take effect. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_service_configuration_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_service_configuration_parameters.rst new file mode 100644 index 0000000..1cb6944 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster/managing_a_service/service_configuration/modifying_service_configuration_parameters.rst @@ -0,0 +1,54 @@ +:original_name: admin_guide_000035.html + +.. _admin_guide_000035: + +Modifying Service Configuration Parameters +========================================== + +Scenario +-------- + +To meet actual service requirements, cluster administrators can quickly view and modify default service configurations on FusionInsight Manager. Configure parameters based on the information provided in the configuration description. + +.. note:: + + The parameters of DBService cannot be modified when only one DBService role instance exists in the cluster. + +Impact on the System +-------------------- + +- After configuring properties of a service, you need to restart the service if the service status is **Expired**. The service is unavailable during the restart. +- After the service configuration parameters are modified and then take effect after restart, you need to download and install the client again or download the configuration file to update the client. For example, you can modify configuration parameters of the following services: HBase, HDFS, Hive, Spark, YARN, and MapReduce. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Click the specified service name on the service management page. + +#. Click **Configuration**. + + The **Basic Configuration** page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + +#. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + .. note:: + + Select a port parameter value from the value range on the right. Ensure that all parameter values in the same service are within the value range and are unique. Otherwise, the service fails to be started. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + +#. Click **Save**. In the confirmation dialog box, click **OK**. + + Wait until the message "Operation succeeded." is displayed. Click **Finish**. + + The configuration is modified. + + .. note:: + + - To update the queue configuration of the YARN service without restarting service, choose **More** > **Refresh Queue** to update the queue for the configuration to take effect. + - During configuration of the **flume.config.file** parameter, you can upload and download files. After a configuration file is uploaded, the old file will be overwritten. If the configuration is not saved and the service is restarted, the configuration does not take effect. Save the configuration in time. + - If you need to restart the service for the configuration to take effect after modifying service configuration parameters, choose **More** > **Restart Service** in the upper right corner of the service page. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/assigning_user_permissions_after_cross-cluster_mutual_trust_is_configured.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/assigning_user_permissions_after_cross-cluster_mutual_trust_is_configured.rst new file mode 100644 index 0000000..ceed687 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/assigning_user_permissions_after_cross-cluster_mutual_trust_is_configured.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000178.html + +.. _admin_guide_000178: + +Assigning User Permissions After Cross-Cluster Mutual Trust Is Configured +========================================================================= + +Scenario +-------- + +After cross-Manager cluster mutual trust is configured, assign user access permissions on FusionInsight Managers so that these users can perform service operations in the mutually trusted Managers. + +Prerequisites +------------- + +The mutual trust between the two Managers has been configured. + +Procedure +--------- + +#. Log in to the local FusionInsight Manager. + +#. .. _admin_guide_000178__en-us_topic_0046737084_chk_user: + + Choose **System** > **Permission** > **User** to check whether the target user exists. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _admin_guide_000178__en-us_topic_0046737084_mod_user: + + Click |image1| on the left of the target user, and check whether the permissions assigned to the user group of the user and the roles meet service requirements. If not, create a role and bind the role to the user by referring to :ref:`Configuring Permissions `, or modify the user group or role permissions of the user. + +#. .. _admin_guide_000178__en-us_topic_0046737084_add_user: + + Create a user required by the service operations and associate the required user group or role. For details, see :ref:`Creating a User `. + +#. Log in to the other FusionInsight Manager and repeat :ref:`2 ` to :ref:`4 ` to create a user with the same name and set permissions. + +.. |image1| image:: /_static/images/en-us_image_0263899656.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/changing_managers_domain_name.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/changing_managers_domain_name.rst new file mode 100644 index 0000000..fa863f2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/changing_managers_domain_name.rst @@ -0,0 +1,143 @@ +:original_name: admin_guide_000176.html + +.. _admin_guide_000176: + +Changing Manager's Domain Name +============================== + +Scenario +-------- + +The secure usage scope of users in each system is called a domain. Each system must have a unique domain name. The domain name of FusionInsight Manager is generated during installation. The system administrator can change the domain name on FusionInsight Manager. + +.. important:: + + - Changing the system domain name is a high-risk operation. Before performing operations in this section, ensure that the OMS data has been backed up by referring to :ref:`Backing Up Manager Data `. + +Impact on the System +-------------------- + +- During the configuration, all of the clusters need to be restarted and are unavailable during restart. + +- After the domain name is changed, the passwords of the Kerberos administrator and OMS Kerberos administrator will be initialized. You need to use the default passwords and then change the passwords. If a component user whose password is generated randomly by the system is used for identity authentication, see :ref:`Exporting an Authentication Credential File ` to download the keytab file again. + +- After the domain name is changed, passwords of the **admin** user, component user, and human-machine user added by the system administrator before the domain name change will be reset to the same one. Change these passwords. The reset password consists of two parts: one part is generated by the system and the other is set by the user. The system generating part is **Admin@123**, which is the default password. For details about the user-defined part, see descriptions of **Password Suffix** in :ref:`Table 2 `. For example, if the system generates **Admin@123** and the user sets **Test#$%@123**, the new password after reset is **Admin@123Test#$%@123**. + +- The new password must meet the password policies. To obtain the new human-machine user password, log in to the active OMS as user **omm** and run the following script: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/get_reset_pwd.sh** *Password suffix* *user_name* + + - *Password suffix* is a parameter set by the user. If it is not specified, the default value **Admin@123** is used. + - *user_name* is optional. The default value is **admin**. + + Example: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/get_reset_pwd.sh Test#$%@123** + + .. code-block:: + + To get the reset password after changing cluster domain name. + pwd_min_len : 8 + pwd_char_types : 4 + The password reset after changing cluster domain name is: "Admin@123Test#$%@123" + + In this example, **pwd_min_len** and **pwd_char_types** indicate the minimum password length and number of password character types respectively defined in the password policies. **Admin@123Test#$%@123** indicates the human-machine user password after the system domain name is changed. + +- After the system domain name is changed, the reset password consists of two parts: one part is generated by the system and the other is set by the user. The reset password must meet the password policies. If the password is not long enough, one or multiple at signs (@) are added between **Admin@123** and the user-defined part. If there are five character types, a space is added after **Admin@123**. + + When the user-defined part is **Test@123** and the default user password policy is used, the new password is **Admin@123Test@123**. The password contains 17 characters of four types. To meet the current password policy, the new password is processed according to :ref:`Table 1 `. + + .. _admin_guide_000176__table172285275013: + + .. table:: **Table 1** Password processing + + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | Minimum Password Length | Number of Character Types | Processing Against the Password Policy | New Password | + +=========================+===========================+========================================+======================+ + | 8 to 17 characters | 4 | The user password policy is met. | Admin@123Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | 18 characters | 4 | Add an at sign (@). | Admin@123@Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | 19 characters | 4 | Add two at signs (@). | Admin@123@@Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | 8 to 18 characters | 5 | Add a space. | Admin@123 Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | 19 characters | 5 | Add a space and an at sign (@). | Admin@123 @Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + | 20 characters | 5 | Add a space and two at signs (@). | Admin@123 @@Test@123 | + +-------------------------+---------------------------+----------------------------------------+----------------------+ + +- After the system domain name is changed, download the **keytab** file for the machine-machine user added by the system administrator before the domain name is changed. + +- After the system domain name is changed, download and install the client again. + +Prerequisites +------------- + +- The system administrator has clarified service requirements and planned domain names for the systems. + + A domain name can contain only uppercase letters, numbers, periods (.), and underscores (_), and must start with a letter or number. + +- The running status of all components in the Manager clusters is **Normal**. + +- The **acl.compare.shortName** parameter of the ZooKeeper service of all clusters in Manager is set to default value **true**. Otherwise, change the value to **true** and restart the ZooKeeper service. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **Domain and Mutual Trust**. + +#. Modify required parameters. + + .. _admin_guide_000176__en-us_topic_0046737082_table66281275: + + .. table:: **Table 2** Related parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================+ + | Local Domain | Planned domain name of the system. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password Suffix | Part of the password set by the user after the password of the human-machine user is reset. This parameter is mandatory. The default value is **Admin@123**. | + | | | + | | .. note:: | + | | | + | | This parameter takes effect only after **Local Domain** is modified. The following conditions must be met: | + | | | + | | - The password ranges from 8 to 16 characters. | + | | - The password must contain at least three types of the following: uppercase letters, lowercase letters, numbers, and special characters (:literal:`\`~!@#$%^&*(`)-_=+|[{}];:',<.>/? and spaces). | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. Proceed with the subsequent steps only after the modification is complete. + +#. Log in to the active management node as user **omm**. + +#. Run the following command to update the domain configuration: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/restart-RealmConfig.sh** + + The command is executed successfully if the following information is displayed: + + .. code-block:: + + Modify realm successfully. Use the new password to log into FusionInsight again. + + .. note:: + + After the restart, some hosts and services cannot be accessed and an alarm is generated. This problem can be automatically resolved in about 1 minute after **restart-RealmConfig.sh** is run. + +7. Log in to FusionInsight Manager using the new password of user **admin** (for example, **Admin@123Admin@123**). On the dashboard, click |image1| next to the name of the target cluster and select **Restart**. + + In the displayed dialog box, enter the password of the current login user and click **OK**. + + In the displayed dialog box, click **OK**. Wait for a while until a message indicating that the operation is successful is displayed. Click **Finish**. + +8. Log out of FusionInsight Manager and then log in again. If the login is successful, the configuration is successful. + +9. Log in to the active management node as user **omm** and run the following command to update the configurations of the job submission client: + + **sh /opt/executor/bin/refresh-client-config.sh** + +.. |image1| image:: /_static/images/en-us_image_0267694670.jpg diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/configuring_cross-manager_mutual_trust_between_clusters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/configuring_cross-manager_mutual_trust_between_clusters.rst new file mode 100644 index 0000000..ada3620 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/configuring_cross-manager_mutual_trust_between_clusters.rst @@ -0,0 +1,100 @@ +:original_name: admin_guide_000177.html + +.. _admin_guide_000177: + +Configuring Cross-Manager Mutual Trust Between Clusters +======================================================= + +Scenario +-------- + +When two security-mode clusters managed by different FusionInsight Managers need to access each other's resources, the system administrator can configure cross-Manager mutual trust for them. + +The secure usage scope of users in each system is called a domain. Each FusionInsight Manager must have a unique domain name. Cross-Manager access allows users to use resources across domains. + +.. note:: + + A maximum of 500 mutually trusted clusters can be configured. + +Impact on the System +-------------------- + +- After cross-Manager cluster mutual trust is configured, users of an external system can be used in the local system. The system administrator needs to periodically check the user permissions in Manager based on enterprise service and security requirements. +- When cross-Manager cluster mutual trust is configured, all clusters need to be stopped, causing service interruptions. +- After cross-Manager cluster mutual trust is configured, internal Kerberos users **krbtgt/**\ *Local cluster domain name*\ **@**\ *External cluster domain name* and **krbtgt/**\ *External cluster domain name*\ **@**\ *Local cluster domain name* are added to the two mutually trusted clusters. The internal users cannot be deleted. The system administrator needs to change the passwords periodically based on enterprise service and security requirements. The passwords of these four users in the two systems must be the same. For details, see :ref:`Changing the Password for a Component Running User `. When the passwords are changed, the connectivity between cross-cluster service applications may be affected. +- After cross-Manager cluster mutual trust is configured, the clients of each cluster need to be downloaded and installed again. +- After cross-Manager cluster mutual trust is configured, you need to check whether the system works properly and how to access resources of the peer system as a user of the local system. For details, see :ref:`Assigning User Permissions After Cross-Cluster Mutual Trust Is Configured `. + +Prerequisites +------------- + +- The system administrator has clarified service requirements and planned domain names for the systems. A domain name can contain only uppercase letters, numbers, periods (.), and underscores (_), and must start with a letter or number. +- The domain names of the two Managers are different. When an ECS or BMS cluster is created on MRS, a unique system domain name is randomly generated. Generally, you do not need to change the system domain name. +- The two clusters do not have the same host name or the same IP address. +- The system time of the two clusters is consistent, and the NTP services in the two systems use the same clock source. +- The running status of all components in the Manager clusters is **Normal**. +- The **acl.compare.shortName** parameter of the ZooKeeper service of all clusters in Manager is set to default value **true**. Otherwise, change the value to **true** and restart the ZooKeeper service. + +Procedure +--------- + +#. Log in to one FusionInsight Manager. + +#. Stop all clusters on the home page. + + Click |image1| next to the target cluster and select **Stop**. Enter the password of the cluster administrator. In the **Stop Cluster** dialog box that is displayed, click **OK**. Wait until the cluster is stopped. + +#. Choose **System** > **Permission** > **Domain and Mutual Trust**. + +#. Modify **Peer Mutual Trust Domain**. + + .. table:: **Table 1** Related parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===============================================================================================================================================================================================================================================================================================================+ + | realm_name | Enter the domain name of the peer system. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ip_port | Enter the KDC address of the peer system. | + | | | + | | Value format: *IP address of the node accommodating the Kerberos service in the peer system:Port number* | + | | | + | | - In dual-plane networking, enter the service plane IP address. | + | | | + | | - If an IPv6 address is used, the IP address must be enclosed in square brackets ([]). | + | | | + | | - Use commas (,) to separate the KDC addresses if the active and standby Kerberos services are deployed or multiple clusters in the peer system need to establish mutual trust with the local system. | + | | | + | | - You can obtain the port number from the **kdc_ports** parameter of the KrbServer service. The default value is **21732**. To obtain the IP address of the node where the service is deployed, click the **Instance** tab on the KrbServer page and view **Service IP Address** of the KerberosServer role. | + | | | + | | For example, if the Kerberos service is deployed on nodes at **10.0.0.1** and **10.0.0.2** that have established mutual trust with the local system, the parameter value is **10.0.0.1:21732,10.0.0.2:21732**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + If you need to configure mutual trust for multiple Managers, click |image2| to add a new item and set parameters. A maximum of 16 systems can be mutually trusted. Click |image3| to delete unnecessary configurations. + +#. Click **OK**. + +#. Log in to the active management node as user **omm**, and run the following command to update the domain configuration: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/restart-RealmConfig.sh** + + The command is executed successfully if the following information is displayed: + + .. code-block:: + + Modify realm successfully. Use the new password to log into FusionInsight again. + + After the restart, some hosts and services cannot be accessed and an alarm is generated. This problem can be automatically resolved in about 1 minute after **restart-RealmConfig.sh** is run. + +#. Log in to FusionInsight Manager and start all clusters. + + Click |image4| next to the name of the target cluster and select **Start**. In the displayed **Start Cluster** dialog box, click **OK**. Wait until the cluster is started. + +#. Log in to the other FusionInsight Manager and repeat the preceding operations. + +.. |image1| image:: /_static/images/en-us_image_0278119935.png +.. |image2| image:: /_static/images/en-us_image_0263899403.png +.. |image3| image:: /_static/images/en-us_image_0263899531.png +.. |image4| image:: /_static/images/en-us_image_0263899495.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/index.rst new file mode 100644 index 0000000..938489f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000174.html + +.. _admin_guide_000174: + +Cluster Mutual Trust Management +=============================== + +- :ref:`Overview of Mutual Trust Between Clusters ` +- :ref:`Changing Manager's Domain Name ` +- :ref:`Configuring Cross-Manager Mutual Trust Between Clusters ` +- :ref:`Assigning User Permissions After Cross-Cluster Mutual Trust Is Configured ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview_of_mutual_trust_between_clusters + changing_managers_domain_name + configuring_cross-manager_mutual_trust_between_clusters + assigning_user_permissions_after_cross-cluster_mutual_trust_is_configured diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/overview_of_mutual_trust_between_clusters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/overview_of_mutual_trust_between_clusters.rst new file mode 100644 index 0000000..9fc4d16 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/cluster_mutual_trust_management/overview_of_mutual_trust_between_clusters.rst @@ -0,0 +1,30 @@ +:original_name: admin_guide_000175.html + +.. _admin_guide_000175: + +Overview of Mutual Trust Between Clusters +========================================= + +Function Description +-------------------- + +By default, users of a big data cluster in security mode can only access resources in the cluster but cannot perform identity authentication or access resources in other clusters in security mode. + +Feature Description +------------------- + +- **Domain** + + The secure usage scope of users in each system is called a domain. Each FusionInsight Manager must have a unique domain name. Cross-Manager access allows users to use resources across domains. + +- **User Encryption** + + Mutual trust can be configured across FusionInsight Managers. The current Kerberos server supports only the aes256-cts-hmac-sha1-96:normal and aes128-cts-hmac-sha1-96:normal encryption types for encrypting cross-domain users, and the encryption types cannot be changed. + +- **User Authentication** + + After cross-Manager mutual trust is configured, if a user with the same name exists in two systems and the user in the peer system has the permission to access a resource in that system, this user can also access the remote resource. + +- **Direct Mutual Trust** + + The system saves the mutual trust ticket of the peer system in two clusters with mutual trust configured and uses the mutual trust ticket to access the peer system. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/index.rst new file mode 100644 index 0000000..2fd0faf --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000170.html + +.. _admin_guide_000170: + +Configuring Client +================== + +- :ref:`Installing a Client ` +- :ref:`Using a Client ` +- :ref:`Updating the Configuration of an Installed Client ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_a_client + using_a_client + updating_the_configuration_of_an_installed_client diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/installing_a_client.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/installing_a_client.rst new file mode 100644 index 0000000..60f5e1e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/installing_a_client.rst @@ -0,0 +1,231 @@ +:original_name: admin_guide_000171.html + +.. _admin_guide_000171: + +Installing a Client +=================== + +Scenario +-------- + +This section describes how to install the clients of all services, except Flume, in the MRS cluster. MRS provides shell scripts for different services so that maintenance personnel can log in to related maintenance clients and implement maintenance operations. + +.. note:: + + - Reinstall the client after server configuration is modified on FusionInsight Manager or after the system is upgraded. Otherwise, the versions of the client and server will be inconsistent. + +Prerequisites +------------- + +- An installation directory will be automatically created if it does not exist. If the directory exists, it must be empty. The directory cannot contain any space. +- If a server outside the cluster is used as the client node, the node can communicate with the cluster service plane. Otherwise, client installation will fail. +- The client must have the NTP service enabled and synchronized time with the NTP server. Otherwise, client installation will fail. +- If clients of all components are downloaded, HDFS and MapReduce are installed in the same directory (*Client directory*\ **/HDFS/**). +- You can install and use the client as any user whose username and password have been obtained from the system administrator. This section uses **user_client** as an example. Ensure that user **user_client** is the owner of the server file directory (for example, **/opt/Bigdata/hadoopclient**) and client installation directory (for example, **/opt/client**). The permission for the two directories is **755**. +- You have obtained the component service username (a default user or new user) and password from the system administrator. +- When you install the client as a user other than **omm** or **root**, and the **/var/tmp/patch** directory already exists, you have changed the permission for the directory to **777** and changed the permission for the logs in the directory to **666**. + +Procedure +--------- + +#. Obtain the required software packages. + + Log in to FusionInsight Manager. Click the wanted cluster from the **Cluster** drop-down list. + + Click **More** and select **Download Client**. The **Download Cluster Client** page is displayed. + + .. note:: + + If only one component client is to be installed, choose **Cluster**, click the name of the target cluster, choose **Services**, click a service name, click **More**, and select **Download Client**. The **Download Client** page is displayed. + +#. Set **Select Client Type** to **Complete Client**. + + **Configuration Files Only** is to download client configuration files in the following scenario: After a complete client is downloaded and installed and the system administrator modifies server configurations on Manager, developers need to update the configuration files during application development. + + The platform type can be set to **x86_64** or **aarch64**. + + - **x86_64**: indicates the client software package that can be deployed on the x86 servers. + - **aarch64**: indicates the client software package that can be deployed on the TaiShan servers. + + .. note:: + + The cluster supports two types of clients: **x86_64** and **aarch64**. The client type must match the architecture of the node for installing the client. Otherwise, client installation will fail. + +#. Determine whether to generate a client file on the cluster node. + + - If yes, select **Save to Path**, and click **OK** to generate the client file. By default, the client file is generated in **/tmp/FusionInsight-Client** on the active management node. You can also store the client file in other directories, and user **omm** has the read, write, and execute permissions on the directories. Copy the software package to the file directory, for example, **/opt/Bigdata/hadoopclient**, on the server where the client is to be installed as user **omm** or **root**. Then, go to :ref:`5 `. + + .. note:: + + If you cannot obtain the permissions of user **root**, use user **omm**. + + - If no, click **OK** and specify a local save path to download the complete client. Wait until the download is complete and go to :ref:`4 `. + +#. .. _admin_guide_000171__en-us_topic_0193213980_li4528442311580: + + Upload the software package. + + Use WinSCP to upload the obtained software package as the user (such as **user_client**) who prepares for the installation, to the directory (such as **/opt/Bigdata/hadoopclient**) of the server where the client is to be installed. + + The name of the client software package is in the follow format: **FusionInsight_Cluster\_\ <**\ *Cluster ID*\ **>\ \_Services_Client.tar**. + + The following steps and sections use **FusionInsight_Cluster_1_Services_Client.tar** as an example. + + .. note:: + + The host where the client is to be installed can be a node inside or outside the cluster. If the node is a server outside the cluster, it must be able to communicate with the cluster, and the NTP service must be enabled to ensure that the time is the same as that on the server. + + For example, you can configure the same NTP clock source for external servers as that of the cluster. After the configuration, you can run the **ntpq -np** command to check whether the time is synchronized. + + - If there is an asterisk (*) before the IP address of the NTP clock source in the command output, the synchronization is normal. For example: + + .. code-block:: + + remote refid st t when poll reach delay offset jitter + ============================================================================== + *10.10.10.162 .LOCL. 1 u 1 16 377 0.270 -1.562 0.014 + + - If there is no asterisk (*) before the IP address of the NTP clock source and the value of **refid** is **.INIT.**, or if the command output is abnormal, the synchronization is abnormal. Contact technical support. + + .. code-block:: + + remote refid st t when poll reach delay offset jitter + ============================================================================== + 10.10.10.162 .INIT. 1 u 1 16 377 0.270 -1.562 0.014 + + You can also configure the same chrony clock source for external servers as that for the cluster. After the configuration, run the **chronyc sources** command to check whether the time is synchronized. + + - In the command output, if there is an asterisk (*) before the IP address of the chrony service on the active OMS node, the synchronization is normal. For example: + + .. code-block:: + + MS Name/IP address Stratum Poll Reach LastRx Last sample + =============================================================================== + ^* 10.10.10.162 10 10 377 626 +16us[ +15us] +/- 308us + + - In the command output, if there is no asterisk (*) before the IP address of the NTP service on the active OMS node, and the value of **Reach** is **0**, the synchronization is abnormal. + + .. code-block:: + + MS Name/IP address Stratum Poll Reach LastRx Last sample + =============================================================================== + ^? 10.1.1.1 0 10 0 - +0ns[ +0ns] +/- 0ns + +#. .. _admin_guide_000171__en-us_topic_0193213980_en-us_topic_0046662333_u_login: + + Log in as user **user_client** to the server where the client is to be installed. + +#. Decompress the software package. + + Go to the directory where the installation package is stored, for example, **/opt/Bigdata/hadoopclient**. Run the following command to decompress the installation package to a local directory: + + **tar -xvf** **FusionInsight_Cluster_1_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file: + + **sha256sum -c** **FusionInsight_Cluster_1_Services_ClientConfig.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig.tar: OK + +#. Decompress the obtained installation file. + + **tar -xvf** **FusionInsight_Cluster_1_Services_ClientConfig.tar** + +#. Configure network connections for the client. + + a. Ensure that the host where the client is installed can communicate with the hosts listed in the **hosts** file in the decompression directory (for example, **/opt/Bigdata/hadoopclient/FusionInsight_Cluster\_**\ **\ **\_Services_ClientConfig/hosts**). + b. If the host where the client is installed is not a host in the cluster, you need to set the mapping between the host name and the service plane IP address for each cluster node in **/etc/hosts**, as user **root**. Each host name uniquely maps an IP address. You can perform the following steps to import the domain name mapping of the cluster to the **hosts** file: + + #. Switch to user **root** or a user who has the permission to modify the **hosts** file. + + **su - root** + + #. Go to the directory where the client package is decompressed. + + **cd /opt/Bigdata/hadoopclient/FusionInsight\_Cluster_1_Services_ClientConfig** + + #. Run the **cat realm.ini >> /etc/hosts** command to import the domain name mapping to the **hosts** file. + + .. note:: + + - If the host where the client is installed is not a node in the cluster, configure network connections for the client to prevent errors when you run commands on the client. + - If Spark tasks are executed in yarn-client mode, add the **spark.driver.host** parameter to the file *Client installation directory*\ **/Spark/spark/conf/spark-defaults.conf** and set the parameter to the client IP address. + - If the yarn-client mode is used, you need to configure the mapping between the IP address and host name of the client in the **hosts** file on the active and standby Yarn nodes (ResourceManager nodes in the cluster) to make sure that the Spark web UI is properly displayed. + +#. Go to the directory where the installation package is stored, and run the following command to install the client to a specified directory (an absolute path), for example, **/opt/client**: + + **cd /opt/Bigdata/hadoopclient/FusionInsight\_Cluster_1_Services_ClientConfig** + + Run the **./install.sh /opt/client** command to install the client. The client is successfully installed if information similar to the following is displayed: + + .. code-block:: + + The component client is installed successfully + + .. note:: + + - If the **/opt/hadoopclient** directory has been used by existing service clients, you need to use another directory in this step when installing other service clients. + - You must delete the client installation directory when uninstalling a client. + - To ensure that an installed client can only be used by the installation user (for example, **user_client**), add parameter **-o** during the installation. That is, run the **./install.sh /opt/hadoopclient -o** command to install the client. + - If the NTP server is to be installed in **chrony** mode, ensure that the parameter **chrony** is added during the installation, that is, run the **./install.sh /opt/client -o** **chrony** command to install the client. + - If an HBase client is installed, it is recommended that the client installation directory contain only uppercase and lowercase letters, digits, and special characters ``(_-?.@+=)`` due to the limitation of the Ruby syntax used by HBase. + - If the client node is a server outside the cluster and cannot communicate with the service plane IP address of the active OMS node or cannot access port 20029 of the active OMS node, the client can be successfully installed but cannot be registered with the cluster or displayed on the UI. + +#. Log in to the client to check whether the client is successfully installed. + + a. Run the **cd /opt/client** command to go to the client installation directory. + + b. Run the **source bigdata_env** command to configure environment variables for the client. + + c. For a cluster in security mode, run the following command to set **kinit** authentication and enter the password for logging in to the client. For a cluster in normal mode, user authentication is not required. + + **kinit admin** + + .. code-block:: + + Password for xxx@HADOOP.COM: #Enter the login password of user admin (same as the user password for logging in to the cluster). + + d. Run the **klist** command to query and confirm authentication details. + + .. code-block:: + + Ticket cache: FILE:/tmp/krb5cc_0 + Default principal: xxx@HADOOP.COM + + Valid starting Expires Service principal + 04/09/2021 18:22:35 04/10/2021 18:22:29 krbtgt/HADOOP.COM@HADOOP.COM + + .. note:: + + - When kinit authentication is used, the ticket is stored in the **/tmp/krb5cc\_**\ *uid* directory by default. + + *uid* indicates the ID of the user who logs in to the OS. For example, if the UID of user **root** is 0, the ticket generated for kinit authentication after user **root** logs in to the system is stored in the **/tmp/krb5cc_0** directory. + + If the current user does not have the read/write permission for the **/tmp** directory, the ticket cache path is changed to **Client installation directory/tmp/krb5cc_uid**. For example, if the client installation directory is **/opt/hadoopclient**, the kinit authentication ticket is stored in **/opt/hadoopclient/tmp/krb5cc_uid**. + + - If the same user is used to log in to the OS for kinit authentication, there is a risk that tickets are overwritten. You can set the **-c** *cache_name* parameter to specify the ticket cache path or set the **KRB5CCNAME** environment variable to avoid this problem. + +#. After the cluster is reinstalled, the previously installed client is no longer available. Perform the following operations to deploy the client again: + + a. Log in to the node where the client is deployed as user **root**. + + b. Run the following command to view the directory where the client is located: (In the following example, **/opt/hadoopclient** is the directory where the client is located.) + + **ll /opt** + + .. code-block:: + + drwxr-x---. 6 root root 4096 Dec 11 19:00 hadoopclient + drwxr-xr-x. 3 root root 4096 Dec 9 02:04 godi + drwx------. 2 root root 16384 Nov 6 01:03 lost+found + drwxr-xr-x. 2 root root 4096 Nov 7 09:49 rh + + c. Run the following command to delete the files in the folder (for example, **/opt/client**) where all client programs are located: + + **mv /opt/client** */tmp/clientbackup* + + d. Reinstall the client. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/updating_the_configuration_of_an_installed_client.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/updating_the_configuration_of_an_installed_client.rst new file mode 100644 index 0000000..f367b1d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/updating_the_configuration_of_an_installed_client.rst @@ -0,0 +1,92 @@ +:original_name: admin_guide_000173.html + +.. _admin_guide_000173: + +Updating the Configuration of an Installed Client +================================================= + +Scenario +-------- + +The cluster provides a client for you to connect to a server, view task results, or manage data. If you modify service configuration parameters on FusionInsight Manager and restart the service, you need to download and install the installed client again or use the configuration file to update the client. + +Prerequisites +------------- + +You have installed a client. + +Procedure +--------- + +**Method 1**: + +#. Log in to FusionInsight Manager. Click the wanted cluster from the **Cluster** drop-down list. + +#. Click **More** and select **Download Client**. In the **Download Cluster Client** dialog box, select **Configuration Files Only**. + + The generated compressed file contains the configuration files of all services. + +#. Determine whether to generate a configuration file on the cluster node. + + - If yes, select **Save to Path**, and click **OK** to generate the client file. By default, the client file is generated in **/tmp/FusionInsight-Client** on the active management node. You can also store the client file in other directories, and user **omm** has the read, write, and execute permissions on the directories. Then, go to :ref:`4 `. + - If no, click **OK** and specify a local save path to download the complete client. Wait until the download is complete, and go to :ref:`4 `. + +#. .. _admin_guide_000173__en-us_topic_0193213946_l6af983f03121493ca3526296f5b650c3: + + Use WinSCP to save the compressed file to the installation directory of the client as the client installation user, such as **/opt/hadoopclient**. + +#. Decompress the software package. + + Run the following commands to go to the directory where the client is installed, and decompress the file to a local directory. For example, the downloaded client file is **FusionInsight_Cluster_1_Services_Client.tar**. + + **cd /opt/hadoopclient** + + **tar -xvf** **FusionInsight_Cluster_1\_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file: + + **sha256sum -c** **FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig_ConfigFiles.tar: OK + +#. Decompress the package to obtain the configuration file. + + **tar -xvf FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar** + +#. Run the following command in the client installation directory to update the client using the configuration file: + + **sh refreshConfig.sh** *Client installation directory* *Directory where the configuration file is located* + + For example, run the following command: + + **sh refreshConfig.sh** **/opt/hadoopclient /opt/hadoop\ client/FusionInsight\_Cluster_1_Services_ClientConfig\_ConfigFiles** + + If the following information is displayed, the configurations have been updated successfully: + + .. code-block:: + + Succeed to refresh components client config. + +**Method 2**: + +#. Log in to the node where the client is installed as user **root**. + +#. Go to the client installation directory, for example, **/opt/client**, and run the following commands to update the configuration file: + + **cd /opt/client** + + **sh autoRefreshConfig.sh** + +#. Enter the username and password of the FusionInsight Manager administrator and the floating IP address of FusionInsight Manager. + +#. Enter the names of the components whose configurations need to be updated. Use commas (,) to separate the component names. Press **Enter** to update the configurations of all components if necessary. + + If the following information is displayed, the configurations have been updated successfully: + + .. code-block:: + + Succeed to refresh components client config. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/using_a_client.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/using_a_client.rst new file mode 100644 index 0000000..9a586a4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_client/using_a_client.rst @@ -0,0 +1,43 @@ +:original_name: admin_guide_000172.html + +.. _admin_guide_000172: + +Using a Client +============== + +Scenario +-------- + +After the client is installed, you can use the shell command on the client in O&M or service scenarios, or use the sample project on the client during application development. + +This section describes how to use the client in O&M scenario or service scenarios. + +Prerequisites +------------- + +- You have installed the client. + + For example, the installation directory is **/opt/client**. + +- Service users of each component have been created by the system administrator based on service requirements. + + A machine-machine user needs to download the **keytab** file and a human-machine user needs to change the password upon the first login. + +Procedure +--------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to switch to the client installation directory: + + **cd /opt/client** + +#. Run the following command to set environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, authenticate the user. For a normal cluster, user authentication is not required. + + **kinit** *Component service user* + +#. Run the **shell** command as required. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_scheduled_backup_of_alarm_and_audit_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_scheduled_backup_of_alarm_and_audit_information.rst new file mode 100644 index 0000000..f9f8b6c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/configuring_scheduled_backup_of_alarm_and_audit_information.rst @@ -0,0 +1,57 @@ +:original_name: admin_guide_000182.html + +.. _admin_guide_000182: + +Configuring Scheduled Backup of Alarm and Audit Information +=========================================================== + +Scenario +-------- + +You can modify the configuration file to periodically back up FusionInsight Manager alarm information, FusionInsight Manager audit information, and audit information of all services to the specified storage location. + +The backup can be performed using FTP or SFTP. FTP does not encrypt data, which may cause security risks. Therefore, SFTP is recommended. + +Procedure +--------- + +#. Log in to the active management node as user **omm**. + + .. note:: + + Perform this operation only on the active management node. Scheduled backup is not supported on the standby management node. + +#. Run the following command to switch the directory: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + +#. Run the following command to configure scheduled backup of FusionInsight Manager's alarm and audit information or service audit information: + + **./setNorthBound.sh -t** *Information type* **-i** *Remote server IP address* **-p** *SFTP or FTP port used by the server*\ **-u** *Username* **-d** *Save path* **-c** *Interval (minutes)* **-m** *Number of records in each file* **-s** *Whether to enable backup* **-e** *Protocol* + + Example: + + **./setNorthBound.sh -t alarm -i 10.0.0.10 -p 22** **-u sftpuser -d /tmp/ -c 10 -m 100 -s true -e sftp** + + This script modifies the alarm backup configuration file **alarm_collect_upload.properties**. The file save path is **${BIGDATA_HOME}/om-server/tomcat/webapps/web/WEB-INF/classes/config**. + + **./setNorthBound.sh -t audit -i 10.0.0.10 -p 22 -u sftpuser -d /tmp/ -c 10** **-m 100** **-s true -e sftp** + + This script modifies the audit backup configuration file **audit_collect_upload.properties**. The file save path is **${BIGDATA_HOME}/om-server/tomcat/webapps/web/WEB-INF/classes/config**. + + **./setNorthBound.sh -t service_audit -i 10.0.0.10 -p 22 -u sftpuser -d /tmp/ -c 10** **-m 100** **-s true -e sftp** + + This script modifies the service audit backup configuration file **service_audit_collect_upload.properties**. The file save path is **${BIGDATA_HOME}/om-server/tomcat/webapps/web/WEB-INF/classes/config**. + +#. Enter the password as prompted. The password is encrypted and saved in the configuration file. + + .. code-block:: + + Please input sftp/ftp server password: + +#. Check the configuration result. If the following information is displayed, the configuration is successful. The configuration file will be automatically synchronized to the standby management node. + + .. code-block:: + + execute command syncfile successfully. + Config Succeed. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/index.rst new file mode 100644 index 0000000..9f2f336 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/index.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000166.html + +.. _admin_guide_000166: + +Cluster Management +================== + +- :ref:`Configuring Client ` +- :ref:`Cluster Mutual Trust Management ` +- :ref:`Configuring Scheduled Backup of Alarm and Audit Information ` +- :ref:`Modifying the FusionInsight Manager Routing Table ` +- :ref:`Switching to the Maintenance Mode ` +- :ref:`Routine Maintenance ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_client/index + cluster_mutual_trust_management/index + configuring_scheduled_backup_of_alarm_and_audit_information + modifying_the_fusioninsight_manager_routing_table + switching_to_the_maintenance_mode + routine_maintenance diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/modifying_the_fusioninsight_manager_routing_table.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/modifying_the_fusioninsight_manager_routing_table.rst new file mode 100644 index 0000000..e7bc258 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/modifying_the_fusioninsight_manager_routing_table.rst @@ -0,0 +1,143 @@ +:original_name: admin_guide_000183.html + +.. _admin_guide_000183: + +Modifying the FusionInsight Manager Routing Table +================================================= + +Scenario +-------- + +When FusionInsight Manager is installed, two pieces of routing information are automatically created on the active management node. You can run the **ip rule list** command to view the routing information, as shown in the following example: + +.. code-block:: + + 0:from all lookup local + 32764:from all to 10.10.100.100 lookup ntp_rt #NTP routing information created by FusionInsight Manager (this information is unavailable if no external NTP clock source is configured). + 32765:from 192.168.0.117 lookup om_rt #OM routing information created by the FusionInsight Manager. + 32766:from all lookup main + 32767:from all lookup default + +.. note:: + + If no external NTP server has been configured, only the OM routing information will be created. + +If the routing information created by FusionInsight Manager conflicts with the routing information configured in the enterprise network planning, the cluster administrator can use **autoroute.sh** to disable or enable the routing information created by FusionInsight Manager. + +Impact on the System +-------------------- + +After the routing information created by FusionInsight Manager is disabled and before the new routing information is set, FusionInsight Manager cannot be accessed but the clusters are running properly. + +Prerequisites +------------- + +FusionInsight Manager has been installed. + +You have obtained routing information about the WS floating IP address. + +Disable the Routing Information Created by the System +----------------------------------------------------- + +#. Log in to the active management node as user **omm**. Run the following commands to disable the routing information created by the system: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + + **./autoroute.sh disable** + + .. code-block:: + + Deactivating Route. + Route operation (disable) successful. + +#. Run the following command to view the execution result: + + **ip rule list** + + .. code-block:: + + 0:from all lookup local + 32766:from all lookup main + 32767:from all lookup default + +#. Run the following command and enter the password of user **root** to switch to user **root**: + + **su - root** + +#. Run the following commands to manually create the routing information about the WS floating IP address: + + **ip route add** *Network segment of the WS floating IP address/Subnet mask of the WS floating IP address* **scope link src** *WS floating IP address* **dev** *NIC of the WS floating IP address* **table om_rt** + + **ip route add default via** *Gateway of the WS floating IP address* **dev** *NIC of the WS floating IP address* **table om_rt** + + **ip rule add from** *WS floating IP address* **table om_rt** + + Example: + + **ip route add 192.168.0.0/255.255.255.0 scope link src 192.168.0.117 dev eth0:ws table om_rt** + + **ip route add default via 192.168.0.254 dev eth0:ws table om_rt** + + **ip rule add from 192.168.0.117 table om_rt** + + .. note:: + + If IPv6 addresses are used, run the **ip -6 route add** command. + +#. Run the following commands to manually create the NTP service routing information. Skip this step when no external NTP clock source is configured. + + **ip route add default via** *IP gateway of the NTP service* **dev** *NIC of the local IP address* **table ntp_rt** + + **ip rule add to** *ntpIP* **table ntp_rt** + + *NIC of the local IP address* indicates the NIC that can communicate with the network segment where the NTP server is located. + + Example: + + **ip route add default via 10.10.100.254 dev eth0 table ntp_rt** + + **ip rule add to 10.10.100.100 table ntp_rt** + +#. View the execution result. + + In the following example, if the command output contains **om_rt** and **ntp_rt**, the operation is successful. + + **ip rule list** + + .. code-block:: + + 0:from all lookup local + 32764:from all to 10.10.100.100 lookup ntp_rt #This information is not displayed if no external NTP clock source is configured. + 32765:from 192.168.0.117 lookup om_rt + 32766:from all lookup main + 32767:from all lookup default + +Enable the Routing Information Created by the System +---------------------------------------------------- + +#. Log in to the active management node as user **omm**. + +#. Run the following commands to enable the routing information created by the system: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + + **./autoroute.sh enable** + + .. code-block:: + + Activating Route. + Route operation (enable) successful. + +#. View the execution result. + + In the following example, if the command output contains **om_rt** and **ntp_rt**, the operation is successful. + + **ip rule list** + + .. code-block:: + + 0:from all lookup local + 32764:from all to 10.10.100.100 lookup ntp_rt #This information is not displayed if no external NTP clock source is configured. + 32765:from 192.168.0.117 lookup om_rt + 32766:from all lookup main + 32767:from all lookup default diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/routine_maintenance.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/routine_maintenance.rst new file mode 100644 index 0000000..3b41b0e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/routine_maintenance.rst @@ -0,0 +1,62 @@ +:original_name: admin_guide_000191.html + +.. _admin_guide_000191: + +Routine Maintenance +=================== + +To ensure long-term and stable running of the system, administrators or maintenance engineers need to periodically check items listed in :ref:`Table 1 ` and rectify the detected faults based on the check results. It is recommended that administrators or engineers record the result in each task scenario and sign off based on the enterprise management regulations. + +.. _admin_guide_000191__t434f37d7cd504b43a86534eca10e2822: + +.. table:: **Table 1** Routine maintenance check items + + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Routine Maintenance Frequency | Task Scenario | Check Item | + +===============================+======================================+=============================================================================================================================================================================================================================================================================================+ + | Daily | Check the cluster service status. | - Check whether the running status and configuration status of each service are normal and whether the status icons are green. | + | | | - Check whether the running status and configuration status of the role instances in each service are normal and whether the status icons are green. | + | | | - Check whether the active/standby status of role instances in each service can be properly displayed. | + | | | - Check whether the dashboard of the services and role instances can be displayed properly. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the cluster host status. | - Check whether the running status of each host is normal and whether the status icon is green. | + | | | - Check the current disk usage, memory usage, and CPU usage of each host. Check whether the current memory usage and CPU usage are increasing. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the cluster alarm information. | Check whether alarms were generated for unhandled exceptions on the previous day, including alarms that were automatically cleared. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the cluster audit information. | Check whether critical and major operations are performed on the previous day and whether the operations are valid. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the cluster backup status. | Check whether OMS, DBService, NameNodeOMS, DBServiceOMS, and LDAP have been automatically backed up on the previous day. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | View the health check result. | Perform a health check on FusionInsight Manager and download the health check report to check whether the current cluster is abnormal. You are advised to enable the automatic health check, export the latest cluster health check result, and repair unhealthy items based on the result. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the network communication. | Check the cluster network status and check whether the network communication between nodes is delayed. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check the storage status. | Check whether the total data storage volume of the cluster increases abruptly. | + | | | | + | | | - Check whether the disk usage is close to the threshold. If yes, locate the causes. For example, check whether the junk data or cold data left by services needs to be cleared. | + | | | - Check whether disk partitions need to be expanded based on the service growth trend. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Check logs. | - Check whether there are failed or unresponsive MapReduce and Spark tasks. Check the **/tmp/logs/${username}/logs/${application id}** log file in HDFS and rectify faults. | + | | | - Check Yarn task logs, view the logs of failed and unresponsive tasks, and delete duplicate data. | + | | | - Check the worker logs of Storm. | + | | | - Back up logs to the storage server. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Weekly | Manage users. | Check whether the user password is about to expire and notify the user of changing the password. To change the password of a machine-machine user, you need to download the keytab file again. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Analyze alarms. | Export and analyze alarms generated in a specified period. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Scan disks. | Check the disk health status. You are advised to use a dedicated disk check tool. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Collect statistics on storage. | Check in batches whether the disk data of cluster nodes is evenly stored, filter out the disks whose data increases significantly or is insufficient, and check whether the disks are normal. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Record changes. | Arrange and record the operations on cluster configuration parameters and files to provide reference for fault analysis and handling. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Monthly | Analyze logs. | - Collect and analyze hardware logs of cluster node servers, such as BMC system logs. | + | | | - Collect and analyze the OS logs of the cluster node servers. | + | | | - Collect and analyze cluster logs. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Diagnose the network. | Analyze the network health status of the cluster. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Manage hardware. | Check the equipment room environment and clean the devices. | + +-------------------------------+--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/switching_to_the_maintenance_mode.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/switching_to_the_maintenance_mode.rst new file mode 100644 index 0000000..8a4b165 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/cluster_management/switching_to_the_maintenance_mode.rst @@ -0,0 +1,94 @@ +:original_name: admin_guide_000189.html + +.. _admin_guide_000189: + +Switching to the Maintenance Mode +================================= + +Scenario +-------- + +FusionInsight Manager allows you to set clusters, services, hosts, or OMSs to the maintenance mode. Objects in maintenance mode do not report alarms. This prevents the system from generating a large number of unnecessary alarms during maintenance changes, such as upgrade, because these alarms may influence O&M personnel's judgment on the cluster status. + +- Cluster maintenance mode + + If a cluster is not brought online or has been brought offline due to O&M operations (for example, non-rolling upgrade), you can set the entire cluster to the maintenance mode. + +- Service maintenance mode + + When performing maintenance operations on a specific service (for example, performing service-affecting commissioning operations like batch restart of service instances, directly powering on or off nodes of the service, or repairing the service), you can set only this service to the maintenance mode. + +- Host maintenance mode + + When performing maintenance operations on a host (such as powering on or off, isolating, or reinstalling the host, upgrading its OS, or replacing the host), you can set only this host to the maintenance mode. + +- OMS maintenance mode + + When restarting, replacing, or repairing an OMS node, you can set the OMS node to the maintenance mode. + +Impact on the System +-------------------- + +After the maintenance mode is set, alarms caused by non-maintenance operations are suppressed and cannot be reported. Alarms can be reported only when faults persist after the system exits the maintenance mode. Therefore, exercise caution when setting the maintenance mode. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Set the maintenance mode. + + Determine the object to set the maintenance mode based on the service scenario. For details, see :ref:`Table 1 `. + + .. _admin_guide_000189__table8578183123419: + + .. table:: **Table 1** Setting to the maintenance mode + + +----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario | Operation | + +====================================================+==============================================================================================================================================================================================================================+ + | Configure a cluster to enter the maintenance mode. | a. On FusionInsight Manager, click |image1| next to the target cluster name and select **Enter Maintenance Mode**. | + | | | + | | b. In the displayed dialog box, click **OK**. | + | | | + | | After the cluster enters the maintenance state, the status of the cluster becomes |image2|. After maintenance is complete, click **Exit Maintenance Mode**. The cluster then exits the maintenance mode. | + +----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configure a service to enter the maintenance mode. | a. On FusionInsight Manager, choose **Cluster**, click the name of the desired cluster, choose **Services**, and click the service name. | + | | | + | | b. On the service details page, click **More** and select **Enter Maintenance Mode**. | + | | | + | | c. In the displayed dialog box, click **OK**. | + | | | + | | After a service enters the maintenance mode, the status of the service becomes |image3| in the service list. After maintenance is complete, click **Exit Maintenance Mode**. The service then exits the maintenance mode. | + | | | + | | .. note:: | + | | | + | | When configuring a service to enter the maintenance mode, you are advised to set the upper-layer services that depend on this service to the maintenance mode as well. | + +----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configure a host to the maintenance mode. | a. On FusionInsight Manager, choose **Hosts**. | + | | | + | | b. On the **Hosts** page, select the target host, click **More**, and select **Enter Maintenance Mode**. | + | | | + | | c. In the displayed dialog box, click **OK**. | + | | | + | | After the host enters the maintenance mode, the status of the host becomes |image4| in the host list. After maintenance is complete, click **Exit Maintenance Mode**. The host then exits the maintenance mode. | + +----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configure the OMS to enter the maintenance mode. | a. On FusionInsight Manager, choose **System** > **OMS** > **Enter Maintenance Mode**. | + | | | + | | b. In the displayed dialog box, click **OK**. | + | | | + | | After the OMS enters the maintenance state, the OMS status becomes |image5|. After maintenance is complete, click **Exit Maintenance Mode**. The OMS then exits the maintenance mode. | + +----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Check the cluster maintenance view. + + On FusionInsight Manager, click |image6| next to the cluster name and select **Maintenance Mode View**. In the displayed window, you can view the services and hosts in maintenance mode in the cluster. + + After maintenance is complete, you can select services and hosts in batches in the maintenance mode view and click **Exit Maintenance Mode** to make them exit the maintenance mode. + +.. |image1| image:: /_static/images/en-us_image_0263899304.png +.. |image2| image:: /_static/images/en-us_image_0263899339.png +.. |image3| image:: /_static/images/en-us_image_0263899293.png +.. |image4| image:: /_static/images/en-us_image_0263899476.png +.. |image5| image:: /_static/images/en-us_image_0263899363.png +.. |image6| image:: /_static/images/en-us_image_0263899235.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/fusioninsight_manager_introduction.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/fusioninsight_manager_introduction.rst new file mode 100644 index 0000000..7dffcdd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/fusioninsight_manager_introduction.rst @@ -0,0 +1,52 @@ +:original_name: admin_guide_000002.html + +.. _admin_guide_000002: + +FusionInsight Manager Introduction +================================== + +Overview +-------- + +MRS allows you to manage and analyze massive amounts of structured and unstructured data for rapid data mining. Open source components have complex structures and therefore they are difficult to install, configure, and manage. FusionInsight Manager is a unified enterprise-level cluster management platform that provides: + +- **Cluster monitoring**: enables you to quickly learn the running status of hosts and services. +- **Graphical metric monitoring and customization**: enable you to obtain key system information in a timely manner. +- **Service property configuration**: allows you to configure service properties based on the performance requirements of your services. +- **Cluster, service, and role instance operations**: allow you to start or stop services and clusters with just a few clicks. +- **Rights management and audit**: allow you to configure the access control and manage operation logs. + +Introduction to the FusionInsight Manager GUI +--------------------------------------------- + +FusionInsight Manager provides a unified cluster management platform, facilitating rapid and easy O&M for clusters. + +|image1| + +The upper part of the page is the operation bar, the middle part is the display area, and the bottom part is the taskbar. + +:ref:`Table 1 ` describes the functions of each portal on the operation bar. + +.. _admin_guide_000002__t2e8dff27d0214c0885ce9fa207af6953: + +.. table:: **Table 1** Functions of each portal on the operation bar + + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Portal | Function Description | + +==================+================================================================================================================================================================================================================================================================================================================================+ + | Home Page | Displays key monitoring metrics of clusters and host statuses in column charts, line charts, and tables. You can customize a dashboard for key monitoring metrics and drag them onto any positions on the UI. The **Summary** tab page supports automatic data update. For details, see :ref:`Home Page `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster | Provides guidance on how to monitor, operate, and configure services in a cluster, helping you manage services in a unified manner. For details, see :ref:`Cluster `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Host | Provides guidance on how to monitor and operate hosts, helping you manage hosts in a unified manner. For details, see :ref:`Hosts `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Provides guidance on how to query and handle alarms, helping you identify and rectify product faults and potential risks in a timely manner to ensure smooth system running. For details, see :ref:`O&M `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit | Allows you to query and export audit logs, and view all user activities and operations. For details, see :ref:`Audit `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Resources | Provides a unified tenant management platform. For details, see :ref:`Tenant Resources `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | System | Provides system management settings of FusionInsight Manager, such as user permission settings. For details, see :ref:`System Configuration `. | + +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001388681282.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/index.rst new file mode 100644 index 0000000..291130a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000001.html + +.. _admin_guide_000001: + +Getting Started +=============== + +- :ref:`FusionInsight Manager Introduction ` +- :ref:`Querying the FusionInsight Manager Version ` +- :ref:`Logging In to FusionInsight Manager ` +- :ref:`Logging In to the Management Node ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + fusioninsight_manager_introduction + querying_the_fusioninsight_manager_version + logging_in_to_fusioninsight_manager + logging_in_to_the_management_node diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_fusioninsight_manager.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_fusioninsight_manager.rst new file mode 100644 index 0000000..1dfce85 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_fusioninsight_manager.rst @@ -0,0 +1,31 @@ +:original_name: admin_guide_000004.html + +.. _admin_guide_000004: + +Logging In to FusionInsight Manager +=================================== + +Scenario +-------- + +Log in to FusionInsight Manager using an account. + +Procedure +--------- + +#. Obtain the URL for logging in to FusionInsight Manager. + +#. On login page, enter the username and password. + +#. Change the password upon your first login. + + The password must: + + - Contain 8 to 64 characters. + - Contain at least four types of the following characters: uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`\`~!@#$%^&*()-_=+|[{}];',<.>/\\?`). + - Be different from the username or the username spelled backwards. + - Be different from the current password. + +#. Move the cursor over |image1| in the upper right corner of home page, and choose **Logout** from the drop-down list. In the dialog box that is displayed, click **OK** to log out of the current user. + +.. |image1| image:: /_static/images/en-us_image_0000001388996278.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_the_management_node.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_the_management_node.rst new file mode 100644 index 0000000..3d6c752 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/logging_in_to_the_management_node.rst @@ -0,0 +1,59 @@ +:original_name: admin_guide_000005.html + +.. _admin_guide_000005: + +Logging In to the Management Node +================================= + +Scenario +-------- + +Some O&M operation scripts and commands need to be run or can be run only on the active management node. You can identify and log in to the active or standby management node based on the following operations. + +Checking and Logging In to the Active and Standby Management Nodes +------------------------------------------------------------------ + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **OMS**. + + In the **Basic Information** area, **Current Active** indicates the host name of the active management node, and **Current Standby** indicates the host name of the standby management node. + + Click a host name to go to the host details page. On the host details page, record the IP address of the host. + +#. Log in to the active or standby management node as user **root**. + +Identifying the Active and Standby Management Nodes by Running Scripts and Logging In to Them +--------------------------------------------------------------------------------------------- + +#. Log in to any node where FusionInsight Manager is installed as user **root**. + +#. Run the following command to identify the active and standby management nodes: + + **su - omm** + + **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** + + In the command output, the node whose **HAActive** is **active** is the active management node (Master1), and the node whose **HAActive** is **standby** is the standby node (Master2). + + .. code-block:: + + HAMode + double + NodeName HostName HAVersion StartTime HAActive HAAllResOK HARunPhase + 192-168-0-30 Master1 V100R001C01 2021-09-01 07:12:05 active normal Actived + 192-168-0-24 Master2 V100R001C01 2021-09-01 07:14:02 standby normal Deactived + +#. Run the following command to obtain the IP addresses of the active and standby management nodes: + + **cat /etc/hosts** + + Example IP addresses of the active and standby management nodes: + + .. code-block:: + + 127.0.0.1 localhost + 192.168.0.30 Master1 + 192.168.0.24 Master2 + +#. Log in to the active or standby management node as user **root**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/querying_the_fusioninsight_manager_version.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/querying_the_fusioninsight_manager_version.rst new file mode 100644 index 0000000..3918e7f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/getting_started/querying_the_fusioninsight_manager_version.rst @@ -0,0 +1,37 @@ +:original_name: admin_guide_000003.html + +.. _admin_guide_000003: + +Querying the FusionInsight Manager Version +========================================== + +By viewing the FusionInsight Manager version, you can prepare for system upgrade and routine maintenance. + +- Using the GUI: + + Log in to FusionInsight Manager. On the home page, click |image1| in the upper right corner and choose **About** from the drop-down list. In the dialog box that is displayed, view the FusionInsight Manager version. + +- Using the CLI + + #. Log in to the FusionInsight Manager active management node as user **root**. + + #. Run the following commands to check the version and platform information of FusionInsight Manager: + + **su - omm** + + **cd ${BIGDATA_HOME}/om-server/om/sbin/pack** + + **./queryManager.sh** + + The following information is displayed: + + .. code-block:: + + Version Package Cputype + *** FusionInsight_Manager_*** x86_64 + + .. note:: + + **\**\*** indicates the version number. Replace it with the actual version number. + +.. |image1| image:: /_static/images/en-us_image_0000001438954277.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/index.rst new file mode 100644 index 0000000..1338b73 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000006.html + +.. _admin_guide_000006: + +Home Page +========= + +- :ref:`Overview ` +- :ref:`Managing Monitoring Metric Reports ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + managing_monitoring_metric_reports diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/managing_monitoring_metric_reports.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/managing_monitoring_metric_reports.rst new file mode 100644 index 0000000..f943499 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/managing_monitoring_metric_reports.rst @@ -0,0 +1,72 @@ +:original_name: admin_guide_000008.html + +.. _admin_guide_000008: + +Managing Monitoring Metric Reports +================================== + +Scenario +-------- + +On FusionInsight Manager, you can customize monitoring items to display on the homepage and export monitoring data. + +.. note:: + + The interval on the horizontal axis of the chart varies depending on the time period you specify. Data monitoring rules are as follows: + + - **0 to 25 hours**: The interval is 5 minutes. The cluster must have been installed for at least 10 minutes, and monitoring data of a maximum of 15 days is saved. + - **25 to 150 hours**: The interval is 30 minutes. The cluster must have been installed for at least 30 minutes, and monitoring data of a maximum of 3 months is saved. + - **150 to 300 hours**: The interval is 1 hour. The cluster must have been installed for at least 1 hour, and monitoring data of a maximum of 3 months is saved. + - **300 hours to 300 days**: The interval is 1 day. The cluster must have been installed for at least 1 day, and monitoring data of a maximum of 6 months is saved. + - **Over 300 days**: The interval is 7 days. The cluster must have been installed for more than 7 days, and monitoring data of a maximum of 1 year is saved. + - If the disk usage of the partition where GaussDB resides exceeds 80%, real-time monitoring data and monitoring data whose interval is 5 minutes will be deleted. + - **Storage resources (HDFS) in Tenant Resources (0 to 300 hours)**: The interval is 1 hour. The cluster must have been installed for at least 1 hour, and monitoring data of a maximum of 3 months is saved. + +Customizing a Monitoring Metric Report +-------------------------------------- + +#. Log in to FusionInsight Manager. +#. Choose **Homepage**. +#. In the upper right corner of the chart area, click |image1| and choose **Customize** from the displayed menu. + + .. note:: + + Monitoring data of the past 1 hour is displayed at an interval of 5 minutes. After you enter the **Real-time Monitoring** page, you can view that real-time monitoring data is displayed on the right of the monitoring chart at an interval of 5 minutes. + +#. In the left pane of the **Customize Statistics** dialog box, select a resource to monitor. +#. Select one or multiple monitoring metrics in the right pane. +#. Click **OK**. + +Exporting All Monitoring Data +----------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Homepage**. + +#. In the upper right corner of the chart area, select a time range to obtain monitoring data, for example, **1w**. + + Real-time data is displayed by default, which cannot be exported. You can click |image2| to customize a time range. + +#. In the upper right corner of the chart area, click |image3| and choose **Export** from the displayed menu. + +Exporting Monitoring Data of a Specified Monitoring Item +-------------------------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Choose **Homepage**. + +#. Click |image4| in the upper right corner of any monitoring report pane in the chart area of the target cluster. + +#. Select a time range to obtain monitoring data, for example, **1w**. + + Real-time data is displayed by default, which cannot be exported. You can click |image5| to customize a time range. + +#. Click **Export**. + +.. |image1| image:: /_static/images/en-us_image_0263899329.png +.. |image2| image:: /_static/images/en-us_image_0263899610.png +.. |image3| image:: /_static/images/en-us_image_0263899528.png +.. |image4| image:: /_static/images/en-us_image_0263899289.png +.. |image5| image:: /_static/images/en-us_image_0263899471.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/overview.rst new file mode 100644 index 0000000..a1a4fd4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/home_page/overview.rst @@ -0,0 +1,64 @@ +:original_name: admin_guide_000007.html + +.. _admin_guide_000007: + +Overview +======== + +After you log in to FusionInsight Manager, **Homepage** is displayed by default. On this page, the **Summary** tab displays the service statuses and monitoring status reports of each cluster, and the **Alarm Analysis** tab displays the statistics and analysis of top alarms. + +- On the right of the home page, you can view the number of alarms of different severities, number of running tasks, current user, and help information. + + - Click |image1| to view the task name, cluster, status, progress, start time, and end time of the last 100 operation tasks in **Task Management Center**. + + .. note:: + + For a start, stop, restart, or rolling restart task, you can abort it by clicking the task name in the task list, clicking **Abort**, and then entering the system administrator password in the dialog box that is displayed. An aborted task is no longer executed. + + - Click |image2| to obtain help information. + + .. table:: **Table 1** Help information + + ===== ======================================================= + Item Description + ===== ======================================================= + About Provides the FusionInsight Manager version information. + ===== ======================================================= + +- The taskbar at the bottom of the home page displays the language options of FusionInsight Manager and the current cluster time and time zone information. You can switch the system language as needed. + +Service Status Preview Area +--------------------------- + +The number of hosts available in and the number of services installed in each cluster are displayed on the left of the homepage. You can click |image3| to expand all service information of the cluster and view the status and alarms of each service. + +Click |image4| to perform basic O&M management operations on the current cluster. For details, see :ref:`Table 1 `. + +The |image5| icon on the left of each service name indicates that the service is running properly; the |image6| icon indicates that the current service fails to start; and the |image7| icon indicates that the current service is not started. + +You can also check whether alarms have been generated for the service on the right of the service name. If alarms have been generated, the alarm severities and the number of alarms are displayed. + +For components that support multiple services, if multiple services have been installed in the same cluster, the number of installed services is displayed on the right of each component. + +The |image8| icon displayed on the right of the service name indicates that the service configuration has expired. + +Monitoring Status Report Area +----------------------------- + +The chart area is on the right of the homepage, which displays key monitoring metric reports, such as the status of all hosts in the cluster, host CPU usage, and host memory usage. You can customize monitoring reports to display in this area. For details about how to manage monitoring metrics, see :ref:`Managing Monitoring Metric Reports `. + +You can view the data source of a monitoring chart in the lower left corner of the chart. You can zoom in on a monitoring report to view chart values more clearly or close the monitoring report. + +Alarm Analysis +-------------- + +On the **Alarm Analysis** tab page, you can view the **Top 20 Alarms** table and **Analysis on Top 3 Alarms** chart. You can click an alarm name in the **Top 20 Alarms** table to view analysis information of this alarm only. Alarm analysis allows you to view top alarms and their occurrence time so you can handle alarms accordingly, improving system stability. + +.. |image1| image:: /_static/images/en-us_image_0000001388835338.png +.. |image2| image:: /_static/images/en-us_image_0000001438841461.png +.. |image3| image:: /_static/images/en-us_image_0263899453.png +.. |image4| image:: /_static/images/en-us_image_0263899217.png +.. |image5| image:: /_static/images/en-us_image_0263899616.png +.. |image6| image:: /_static/images/en-us_image_0263899555.png +.. |image7| image:: /_static/images/en-us_image_0263899343.png +.. |image8| image:: /_static/images/en-us_image_0263899493.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/configuring_racks_for_hosts.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/configuring_racks_for_hosts.rst new file mode 100644 index 0000000..3436a0c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/configuring_racks_for_hosts.rst @@ -0,0 +1,87 @@ +:original_name: admin_guide_000058.html + +.. _admin_guide_000058: + +Configuring Racks for Hosts +=========================== + +Scenario +-------- + +All hosts in a large cluster are usually deployed on multiple racks. Hosts on different racks communicate with each other through switches. The network bandwidth between different hosts on the same rack is much greater than that on different racks. In this case, plan the network topology based on the following requirements: + +- To improve the communication speed, it is recommended that data be exchanged between hosts on the same rack. +- To improve the fault tolerance capability, distribute processes or data of distributed services on different hosts of multiple racks as dispersedly as possible. + +Hadoop uses a file directory structure to represent hosts. + +The HDFS cannot automatically determine the network topology of each DataNode in the cluster. You need to set the rack name to identify the rack where the host is located so that the NameNode can draw the network topology of the required DataNodes and back up data of the DataNodes to different racks. Similarly, YARN needs to obtain rack information and allocate tasks to different NodeManagers as required. + +If the cluster network topology changes, you need to reallocate racks for hosts on FusionInsight Manager so that related services can be automatically adjusted. + +Impact on the System +-------------------- + +If the name of the host rack is changed, storage policy for HDFS replicas, YARN task assignment, and storage location of Kafka partitions will be affected. After the modification, you need to restart the HDFS, YARN, and Kafka for the configuration to take effect. + +Improper rack configuration will unbalance loads (including CPU, memory, disk, and network) among nodes in the cluster, which decreases the cluster reliability and stability. Therefore, before allocating racks, take all aspects into consideration and properly set racks. + +Rack Allocation Policies +------------------------ + +.. note:: + + Physical rack: indicates the real rack where the host resides. + + Logical rack: indicates the rack name of the host on FusionInsight Manager. + +Policy 1: Each logical rack has nearly the same number of hosts. + +Policy 2: The name of the logical rack of the host must comply with that of the physical rack to which the host belongs. + +Policy 3: If there are only few hosts on a physical rack, combine this physical rack and other physical racks with few hosts into a logical rack, which complies with policy 1. Hosts in two equipment rooms cannot be placed in one logical rack. Otherwise, performance problems may be caused. + +Policy 4: If there are lots of hosts on a physical rack, divide these hosts into multiple logical racks, which complies with policy 1. Hosts with great differences should not be placed in the same logical rack. Otherwise, the cluster reliability will be decreased. + +Policy 5: You are advised to set **default** or other values for logical racks on the first layer, and the values in the same cluster must be consistent. + +Policy 6: The number of hosts in each rack cannot be less than 3. + +Policy 7: A cluster can contain at most 50 logical racks. If there are too many logical racks in a cluster, the maintenance is difficult. + +Best Practices +-------------- + +For example, in a cluster, 100 hosts are located in two equipment rooms A and B. A has 40 hosts and B has 60 hosts. In room A, there are 11 hosts on physical rack Ra1 and 29 hosts on physical rack Ra2. In room B, there are six hosts on physical rack Rb1, 33 hosts on physical rack Rb2, 18 hosts on physical rack Rb3, and three hosts on physical rack Rb4. + +According to the rack allocation policy, each logical rack contains nearly the same number (for example, 20) of hosts. The allocation details are as follows: + +- Logical rack /default/racka1: 11 hosts on physical rack Ra1 and nine hosts on physical rack Ra2 +- Logical rack /default/racka2: the remaining 20 hosts (except the nine hosts of logical rack /default/racka1) on physical rack Ra2 +- Logical rack /default/rackb1: six hosts on physical rack Rb1 and 13 hosts on physical rack Rb2 +- Logical rack /default/rackb2: the remaining 20 hosts on physical rack Rb2 +- Logical rack /default/rackb3: 18 hosts on physical rack Rb3 and three hosts on physical rack Rb4 + +Rack allocation example: + +|image1| + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Click **Hosts**. +#. Select the check box of the target host. +#. Select **Set Rack** from the **More** drop-down list. + + - Set rack names in hierarchy based on the actual network topology. Separate racks from different layers using slashes (/). + + - Rack naming rules are as follows: */level1/level2/...* The number of levels must be at least 1, and the name cannot be empty. A rack can contain letters, digits, and underscores (_) and cannot exceed 200 characters. + + For example, /default/rack0. + + - If the hosts in the rack to be modified contain DataNode instances, ensure that the rack name levels of the hosts where all DataNode instances reside are the same. Otherwise, the configuration fails to be delivered. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899649.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/exporting_host_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/exporting_host_information.rst new file mode 100644 index 0000000..e8e9406 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/exporting_host_information.rst @@ -0,0 +1,19 @@ +:original_name: admin_guide_000062.html + +.. _admin_guide_000062: + +Exporting Host Information +========================== + +Scenario +-------- + +Administrators can export information about all hosts on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Click **Hosts**. +#. Specify the status of required hosts in the drop-down list box on the upper right corner, or click **Advanced Search** to specify hosts. +#. Click **Export All**, select **TXT** or **CSV** for **Save As**, and click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/index.rst new file mode 100644 index 0000000..077a494 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000054.html + +.. _admin_guide_000054: + +Host Maintenance Operations +=========================== + +- :ref:`Starting and Stopping All Instances on a Host ` +- :ref:`Performing a Host Health Check ` +- :ref:`Configuring Racks for Hosts ` +- :ref:`Isolating a Host ` +- :ref:`Exporting Host Information ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + starting_and_stopping_all_instances_on_a_host + performing_a_host_health_check + configuring_racks_for_hosts + isolating_a_host + exporting_host_information diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/isolating_a_host.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/isolating_a_host.rst new file mode 100644 index 0000000..82417c0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/isolating_a_host.rst @@ -0,0 +1,52 @@ +:original_name: admin_guide_000059.html + +.. _admin_guide_000059: + +Isolating a Host +================ + +Scenario +-------- + +If a host is abnormal or faulty and cannot provide services or affects the cluster performance, you can remove the host from the available node in the cluster temporarily so that the client can access other available nodes. + +.. note:: + + Only non-management nodes can be isolated. + +Impact on the System +-------------------- + +- After a host is isolated, all role instances on the host will be stopped, and you cannot start, stop, or configure the host and all instances on the host. +- For some services, after a host is isolated, some instances on other nodes do not work, and the service configuration status may expire. +- After a host is isolated, statistics about the monitoring status and indicator data of the host hardware and instances on the host cannot be collected or displayed. +- Retain the default SSH port (22) of the target node. Otherwise, the task described in this section will fail. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Click **Hosts**. + +#. Select the check box of the host to be isolated. + +#. Select **Isolate** from the **More** drop-down list. + + In the displayed dialog box, enter the password of the current login user and click **OK**. + +#. In the displayed confirmation dialog box, select "I confirm to isolate the selected hosts and accept possible consequences of service faults." Click **OK**. + + Wait until the message "Operation succeeded" is displayed, and click **Finish**. + + The host is successfully isolated and **Running Status** is **Isolated**. + +#. Log in to the isolated host as user **root** and run the **pkill -9 -u omm** command to stop the processes of user **omm** on the node. Then run the **ps -ef \| grep 'container' \| grep '${BIGDATA_HOME}' \| awk '{print $2}' \| xargs -I '{}' kill -9 '{}'** command to find and stop the container process. + +#. Cancel the isolation status of the host before using the host if you have rectified the host exception or fault. + + On the **Hosts** page, select the isolated host and choose **More** > **Cancel Isolation**. + + .. note:: + + After the isolation is canceled, all role instances on the host are not started by default. To start role instances on the host, select the target host on the Hosts page and choose **More** > **Start All Instances**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/performing_a_host_health_check.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/performing_a_host_health_check.rst new file mode 100644 index 0000000..57d6217 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/performing_a_host_health_check.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000057.html + +.. _admin_guide_000057: + +Performing a Host Health Check +============================== + +Scenario +-------- + +If the running status of a host is not **Normal**, you can perform health checks on the host to check whether some basic functions are abnormal. During routine O&M, you can perform host health checks to ensure that the configuration parameters and monitoring of each role instance on the host are normal and can run stably for a long time. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Click **Hosts**. + +#. Select the check box of the target host. + +#. Select **Health Check** from the **More** drop-down list to start the health check. + + To export the result of the health check, click **Export Report** in the upper left corner. If any problem is detected, click **Help**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/starting_and_stopping_all_instances_on_a_host.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/starting_and_stopping_all_instances_on_a_host.rst new file mode 100644 index 0000000..ae573e5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_maintenance_operations/starting_and_stopping_all_instances_on_a_host.rst @@ -0,0 +1,19 @@ +:original_name: admin_guide_000056.html + +.. _admin_guide_000056: + +Starting and Stopping All Instances on a Host +============================================= + +Scenario +-------- + +If a host is faulty, you may need to stop all the roles on the host and perform maintenance check on the host. After the host fault is rectified, start all roles running on the host to recover host services. You can start or stop all instances on a host on the host management page or host details page on FusionInsight Manager. The following describes how to perform such operations on the host management page. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Click **Hosts**. +#. Select the check box of the target host. +#. Select **Start All Instances** or **Stop All Instances** from the **More** drop-down list to start or stop all role instances. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/checking_host_processes_and_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/checking_host_processes_and_resources.rst new file mode 100644 index 0000000..bdf4c46 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/checking_host_processes_and_resources.rst @@ -0,0 +1,21 @@ +:original_name: admin_guide_000053.html + +.. _admin_guide_000053: + +Checking Host Processes and Resources +===================================== + +Overview +-------- + +Log in to FusionInsight Manager, click **Hosts**, and click the specified host name in the host list. On the host details page, click the **Process** and **Resource** tabs. + +Host Process +------------ + +On the **Process** tab page, the information about the role processes of the deployed service instances on the current host is displayed, including the process status, PID, and process running time. You can directly view the log files of each process online. + +Host Resource +------------- + +On the **Resource** tab page, the detailed resource usage of deployed service instances on the current host is displayed, including the CPU, memory, disk, and port usage. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/index.rst new file mode 100644 index 0000000..5f3628c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000050.html + +.. _admin_guide_000050: + +Host Management Page +==================== + +- :ref:`Viewing the Host List ` +- :ref:`Viewing the Host Dashboard ` +- :ref:`Checking Host Processes and Resources ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_the_host_list + viewing_the_host_dashboard + checking_host_processes_and_resources diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_dashboard.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_dashboard.rst new file mode 100644 index 0000000..7950038 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_dashboard.rst @@ -0,0 +1,51 @@ +:original_name: admin_guide_000052.html + +.. _admin_guide_000052: + +Viewing the Host Dashboard +========================== + +Overview +-------- + +Log in to FusionInsight Manager, click **Hosts**, and click a host name in the host list. The host details page contains the basic information area, disk status area, role list area, and monitoring chart. + +Basic Information Area +---------------------- + +The basic information area contains the key information about the host, such as the management IP address, service IP address, host type, rack, firewall, number of CPU cores, and OS. + +Disk Status Area +---------------- + +The disk status area contains all disk partitions configured for the cluster on the host and the usage of each disk partition. + +Instance List Area +------------------ + +The instance list area displays all role instances installed on the host and the status of each role instance. You can click the log file next to a role instance name to view the log file content of the instance online. + +Alarm and Event History +----------------------- + +The alarm and event history area displays the key alarms and events reported by the current host. The system can display a maximum of 20 historical records. + +Chart +----- + +The monitoring chart area is displayed on the right of the host details page, and contains the key monitoring metrics of the host. + +You can choose |image1| > **Customize** in the upper right corner to customize the monitoring reports to be displayed in the chart area. Select a time range and choose |image2| > **Export** to export detailed monitoring metric data within the specified time range. + +You can click |image3| next to the title of a monitoring indicator to open the description of the monitoring indicator. + +Click the **Chart** tab of the host to view the full monitoring chart information about the host. + +GPU Card Status Area +-------------------- + +If the host is configured with GPU cards, the GPU card status area displays the model, location, and status of the GPU card installed on the host. + +.. |image1| image:: /_static/images/en-us_image_0263899316.png +.. |image2| image:: /_static/images/en-us_image_0263899637.png +.. |image3| image:: /_static/images/en-us_image_0263899593.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_list.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_list.rst new file mode 100644 index 0000000..5bd7b2c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/host_management_page/viewing_the_host_list.rst @@ -0,0 +1,47 @@ +:original_name: admin_guide_000051.html + +.. _admin_guide_000051: + +Viewing the Host List +===================== + +Overview +-------- + +Log in to FusionInsight Manager, click **Hosts**, and the host list is displayed on the host management page. You can view the host list and basic information of each host. + +You can switch view types and set search criteria to filter and search for hosts. + +Host View +--------- + +You can click **Role View** to view the roles deployed on each host. If the role supports the active/standby mode, the role name is displayed in bold. + +Host List +--------- + +The host list on the host management page contains all hosts in the cluster, and O&M operations can be performed on these hosts. + +On the host management page, you can filter hosts by node type or cluster. The rules for filtering host types are as follows: + +- A management node is the node where OMS is deployed. Additionally, control roles and data roles may also be deployed on management nodes. +- A control node is the node where control roles are deployed. Additionally, data roles may also be deployed on control nodes. +- A Data Node is the node where only data roles are deployed. + +If you select the **Host View**, the IP address, rack planning, AZ name, running status, cluster name, and hardware resource usage of each host are displayed. + +.. table:: **Table 1** Host running status + + +---------------+-------------------------------------------------------------------+ + | Status | Description | + +===============+===================================================================+ + | **Normal** | Indicates that the host is in the normal state. | + +---------------+-------------------------------------------------------------------+ + | **Faulty** | Indicates that the host is abnormal. | + +---------------+-------------------------------------------------------------------+ + | **Unknown** | Indicates that the initial status of the host cannot be detected. | + +---------------+-------------------------------------------------------------------+ + | **Isolated** | Indicates that the host is isolated. | + +---------------+-------------------------------------------------------------------+ + | **Suspended** | Indicates that the host is stopped. | + +---------------+-------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/index.rst new file mode 100644 index 0000000..0cb2db2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000049.html + +.. _admin_guide_000049: + +Hosts +===== + +- :ref:`Host Management Page ` +- :ref:`Host Maintenance Operations ` +- :ref:`Resource Overview ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + host_management_page/index + host_maintenance_operations/index + resource_overview/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/cluster.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/cluster.rst new file mode 100644 index 0000000..6e8c860 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/cluster.rst @@ -0,0 +1,23 @@ +:original_name: admin_guide_000066.html + +.. _admin_guide_000066: + +Cluster +======= + +Log in to FusionInsight Manager and choose **Hosts** > **Resource Overview**. On the **Resource Overview** page that is displayed, click the **Cluster** tab to view resource monitoring of all clusters. + +By default, the monitoring data of the past one hour (**1h**) is displayed. You can click |image1| to customize a time range. Time range options are **1h**, **2h**, **6h**, **12h**, **1d**, **1w**, and **1m**. + + +.. figure:: /_static/images/en-us_image_0000001369944573.png + :alt: **Figure 1** Cluster tab + + **Figure 1** Cluster tab + +- You can click **Specify Cluster** to customize a cluster to display. +- You can choose |image2| > **Customize** to customize the metrics to display on the tab page. For details about the metrics, see :ref:`Table 1 ` in :ref:`Distribution `. +- You can click **Export Data** to export the metric values of each cluster within the time range you have specified. + +.. |image1| image:: /_static/images/en-us_image_0000001318157588.png +.. |image2| image:: /_static/images/en-us_image_0263899311.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/distribution.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/distribution.rst new file mode 100644 index 0000000..a354164 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/distribution.rst @@ -0,0 +1,89 @@ +:original_name: admin_guide_000064.html + +.. _admin_guide_000064: + +Distribution +============ + +Log in to FusionInsight Manager and choose **Hosts** > **Resource Overview**. On the **Resource Overview** page that is displayed, click the **Distribution** tab to view resource distribution of each cluster. By default, the monitoring data of the past one hour (**1h**) is displayed. You can click |image1| to customize a time range. Time range options are **1h**, **2h**, **6h**, **12h**, **1d**, **1w**, and **1m**. + +.. _admin_guide_000064__fig10343181024812: + +.. figure:: /_static/images/en-us_image_0000001087459671.png + :alt: **Figure 1** Distribution tab + + **Figure 1** Distribution tab + +- You can click **Select Metric** to customize the metric to monitor. :ref:`Table 1 ` describes all the metrics that you can select. After you select a metric, the host distribution in each range of the metric is displayed. +- When you hover your cursor over a color column, the number of hosts in the current metric range is displayed. See :ref:`Figure 1 `. You can click a color column to view the list of hosts in the metric range. + + - You can click a host name in the **Host Name** column to access the host details page. + - You can click **View Trends** in the **Operation** column of a host to view the maximum, minimum, and average values of the current metric in the cluster as well as the value of the current host. In the current cluster, if you have selected **Host CPU-Memory-Disk Usage**, **View Trends** is unavailable. + +- You can click **Export Data** to export the maximum, minimum, and average values of the current metric of all nodes in the cluster within the time range you have specified. + +.. _admin_guide_000064__table1190415121488: + +.. table:: **Table 1** Metrics + + +-----------------------------------+--------------------------------------------------------------+ + | Category | Metric | + +===================================+==============================================================+ + | Process | - Number of Running Processes | + | | - Total Number of Processes | + | | - Total Number of omm Processes | + | | - Uninterruptible Sleep Process | + +-----------------------------------+--------------------------------------------------------------+ + | Network Status | - Host Network Packet Collisions | + | | - Number of LAST_ACK States | + | | - Number of CLOSING States | + | | - Number of LISTENING States | + | | - Number of CLOSED States | + | | - Number of ESTABLISHED States | + | | - Number of SYN_RECV States | + | | - Number of TIME_WAITING States | + | | - Number of FIN_WAIT2 States | + | | - Number of FIN_WAIT1 States | + | | - Number of CLOSE_WAIT States | + | | - DNS Name Resolution Duration | + | | - TCP Ephemeral Port Usage | + | | - Host Network Packet Frame Errors | + +-----------------------------------+--------------------------------------------------------------+ + | Network Reading | - Host Network Read Packets | + | | - Host Network Read Dropped Packets | + | | - Host Network Read Error Packets | + | | - Host Network Rx Speed | + +-----------------------------------+--------------------------------------------------------------+ + | Disk | - Host Disk Write Speed | + | | - Host Used Disk | + | | - Host Free Disk | + | | - Host Disk Read Speed | + | | - Host Disk Usage | + +-----------------------------------+--------------------------------------------------------------+ + | Memory | - Free Memory | + | | - Cache Memory Size | + | | - Total Kernel Cache Memory Size | + | | - Shared Memory Size | + | | - Host Memory Usage | + | | - Used Memory | + +-----------------------------------+--------------------------------------------------------------+ + | Network Writing | - Host Network Write Packets | + | | - Host Network Write Error Packets | + | | - Host Network Tx Speed | + | | - Host Network Write Dropped Packets | + +-----------------------------------+--------------------------------------------------------------+ + | CPU | - CPU Usage of Processes Whose Priorities Have Been Changed | + | | - CPU Usage of User Space Processes | + | | - CPU Usage of Kernel Space Processes | + | | - Host CPU Usage | + | | - CPU Total Time | + | | - CPU Idle Time | + +-----------------------------------+--------------------------------------------------------------+ + | Host Status | - Host File Handle Usage | + | | - Average OS Load in 1 Minute | + | | - Average OS Load in 5 Minutes | + | | - Average OS Load in 15 Minutes | + | | - Host PID Usage | + +-----------------------------------+--------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001318123498.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/host.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/host.rst new file mode 100644 index 0000000..b5799b1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/host.rst @@ -0,0 +1,32 @@ +:original_name: admin_guide_000067.html + +.. _admin_guide_000067: + +Host +==== + +Log in to FusionInsight Manager and choose **Hosts** > **Resource Overview**. On the **Resource Overview** page that is displayed, click the **Host** tab to view host resource overview, including basic configurations (CPU/memory) and disk configurations. + +You can click **Export Data** to export the configuration list of all hosts in the cluster, including the host name, management IP address, host type, number of cores, CPU architecture, memory capacity, and disk size. + + +.. figure:: /_static/images/en-us_image_0000001369746545.png + :alt: **Figure 1** Host tab + + **Figure 1** Host tab + +Basic Configurations (CPU/Memory) +--------------------------------- + +You can hover your cursor over the pie chart to view the number of hosts of each hardware configuration in the cluster. The information is displayed in the format of *Number of cores (CPU architecture) Memory size*. + +You can click a slice on the pie chart to view the list of hosts. + +Disk Configurations +------------------- + +The horizontal axis indicates the total disk capacity (including the OS disk) of a node, and the vertical axis indicates the number of logical disks (including the OS disk). + +You can hover your cursor over a dot to view information about disks of the current configuration, including the quantity of disks, total capacity, and number of hosts. + +You can click a dot on the chart to view the list of hosts. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/index.rst new file mode 100644 index 0000000..97ee2b1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000063.html + +.. _admin_guide_000063: + +Resource Overview +================= + +- :ref:`Distribution ` +- :ref:`Trend ` +- :ref:`Cluster ` +- :ref:`Host ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + distribution + trend + cluster + host diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/trend.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/trend.rst new file mode 100644 index 0000000..734d51e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/hosts/resource_overview/trend.rst @@ -0,0 +1,21 @@ +:original_name: admin_guide_000065.html + +.. _admin_guide_000065: + +Trend +===== + +Log in to FusionInsight and choose **Hosts** > **Resource Overview**. On the **Resource Overview** page that is displayed, click the **Trend** tab to view resource trends of all clusters or a single cluster. By default, the monitoring data of the past one hour (**1h**) is displayed. You can click |image1| to customize a time range. Time range options are **1h**, **2h**, **6h**, **12h**, **1d**, **1w**, and **1m**. By default, the trend chart of each metric displays the maximum, minimum, and average values of the entire cluster. + + +.. figure:: /_static/images/en-us_image_0000001370061921.png + :alt: **Figure 1** Trend tab + + **Figure 1** Trend tab + +- You can click **Add Host to Chart** to add trend lines of up to 12 hosts to the trend charts. +- You can choose |image2| > **Customize** to customize the metrics to display on the tab page. For details about the metrics, see :ref:`Table 1 ` in :ref:`Distribution `. +- You can click **Export Data** to export the maximum, minimum, and average values of all nodes in the cluster for all selected metrics within the time range you have specified. + +.. |image1| image:: /_static/images/en-us_image_0000001369965785.png +.. |image2| image:: /_static/images/en-us_image_0263899424.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/index.rst new file mode 100644 index 0000000..3f9605a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/index.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0606.html + +.. _mrs_01_0606: + +FusionInsight Manager Operation Guide (Applicable to 3.x) +========================================================= + +- :ref:`Getting Started ` +- :ref:`Home Page ` +- :ref:`Cluster ` +- :ref:`Hosts ` +- :ref:`O&M ` +- :ref:`Audit ` +- :ref:`Tenant Resources ` +- :ref:`System Configuration ` +- :ref:`Cluster Management ` +- :ref:`Log Management ` +- :ref:`Backup and Recovery Management ` +- :ref:`Security Management ` +- :ref:`Alarm Reference (Applicable to MRS 3.x) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + getting_started/index + home_page/index + cluster/index + hosts/index + o&m/index + audit/index + tenant_resources/index + system_configuration/index + cluster_management/index + log_management/index + backup_and_recovery_management/index + security_management/index + alarm_reference_applicable_to_mrs_3.x/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/about_logs.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/about_logs.rst new file mode 100644 index 0000000..5002e45 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/about_logs.rst @@ -0,0 +1,886 @@ +:original_name: admin_guide_000193.html + +.. _admin_guide_000193: + +About Logs +========== + +Log Description +--------------- + +MRS cluster logs are stored in the **/var/log/Bigdata** directory. The following table lists the log types. + +.. table:: **Table 1** Log types + + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Description | + +===================+===================================================================================================================================================================================================+ + | Installation logs | Installation logs record information about FusionInsight Manager, cluster, and service installation to help users locate installation errors. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Run logs | Run logs record the running track information, debugging information, status changes, potential problems, and error information generated during the running of services. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit logs | Audit logs record information about users' activities and operation instructions, which can be used to locate fault causes in security events and determine who are responsible for these faults. | + +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +The following table lists the MRS log directories. + +.. table:: **Table 2** Log directories + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Directory | Log | + +===================================+======================================================================================================================================================================+ + | /var/log/Bigdata/audit | Component audit log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/controller | Log collecting script log. | + | | | + | | Controller process log. | + | | | + | | Controller monitoring log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/dbservice | DBService log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/flume | Flume log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hbase | HBase log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hdfs | HDFS log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hive | Hive log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/httpd | HTTPd log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hue | Hue log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/kerberos | Kerberos log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/ldapclient | LDAP client log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/ldapserver | LDAP server log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/loader | Loader log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/logman | Logman script log management log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/mapreduce | MapReduce log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/nodeagent | NodeAgent log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/okerberos | OMS Kerberos log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/oldapserver | OMS LDAP log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/metric_agent | Run log file of MetricAgent. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/omm | **oms**: complex event processing log, alarm service log, HA log, authentication and authorization management log, and monitoring service run log of the OMM server. | + | | | + | | **oma**: installation log and run log of the OMM agent. | + | | | + | | **core**: dump log generated when the OMM agent and the HA process are suspended. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/spark2x | Spark2x log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/sudo | Log generated when the **sudo** command is executed by user **omm**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/timestamp | Time synchronization management log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/tomcat | Tomcat log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/watchdog | Watchdog log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/yarn | Yarn log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/zookeeper | ZooKeeper log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/oozie | Oozie log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/kafka | Kafka log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/storm | Storm log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/upgrade | OMS upgrade log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/update-service | Upgrade service log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + After the multi-instance function is enabled, if the system administrator adds multiple HBase, Hive, and Spark service instances, the log description, log level, and log format of the newly added service instances are the same as those of the original service logs. Service instance logs are stored separately in the **/var/log/Bigdata/**\ *servicenameN* directory. The audit logs of the HBase and Hive service instances are stored in the **/var/log/Bigdata/audit/**\ *servicenameN* directory. For example, the logs of HBase1 are stored in the **/var/log/Bigdata/hbase1** and **/var/log/Bigdata/audit/hbase1** directories. + +Installation Logs +----------------- + +.. table:: **Table 3** Installation logs + + +----------------------------------------+------------------------------------------------------------------------------+ + | Installation Log | Description | + +========================================+==============================================================================+ + | Configuration log | Records information about the configuration process before the installation. | + +----------------------------------------+------------------------------------------------------------------------------+ + | FusionInsight Manager installation log | Records information about the two-node FusionInsight Manager installation. | + +----------------------------------------+------------------------------------------------------------------------------+ + | Cluster installation log | Records information about the cluster installation. | + +----------------------------------------+------------------------------------------------------------------------------+ + +Run Logs +-------- + +:ref:`Table 4 ` describes the running information recorded in run logs. + +.. _admin_guide_000193__t6d0bc48e23fc402ba1643b1dcd9f77c4: + +.. table:: **Table 4** Running information + + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Run Log | Description | + +=================================+===========================================================================================================================================================================+ + | Installation preparation log | Records information about preparations for the installation, such as the detection, configuration, and feedback operation information. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process startup log | Records information about the commands executed during the process startup. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process startup exception log | Records information about exceptions during process startup, such as dependent service errors and insufficient resources. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process run log | Records information about the process running track information and debugging information, such as function entries and exits as well as cross-module interface messages. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process running exception log | Records errors that cause process running errors, for example, the empty input objects or encoding or decoding failure. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process running environment log | Records information about the process running environment, such as resource status and environment variables. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Script log | Records information about the script execution process. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource reclamation log | Records information about the resource reclaiming process. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Uninstallation clearing logs | Records information about operations performed during service uninstallation, such as directory and execution time deletion. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _admin_guide_000193__s481f1c14aca34ee788baed345970a5c0: + +Audit Logs +---------- + +Audit information recorded in audit logs includes FusionInsight Manager audit information and component audit information. + +.. table:: **Table 5** Audit information of FusionInsight Manager + + +-----------------------------------+--------------------------------------------------------------------+ + | Operation Type | Operation | + +===================================+====================================================================+ + | User management | Creating a user. | + | | | + | | Modifying a user. | + | | | + | | Deleting a user. | + | | | + | | Creating a user group. | + | | | + | | Modifying a user group. | + | | | + | | Deleting a group. | + | | | + | | Adding a role. | + | | | + | | Changing the user's roles. | + | | | + | | Deleting a role. | + | | | + | | Changing a password policy. | + | | | + | | Changing a password. | + | | | + | | Resetting a password. | + | | | + | | Logging in. | + | | | + | | Logging out. | + | | | + | | Unlocking the screen. | + | | | + | | Downloading the authentication credential. | + | | | + | | Unauthorized operation. | + | | | + | | Unlocking a user account. | + | | | + | | Locking a user account. | + | | | + | | Locking the screen. | + | | | + | | Exporting a user. | + | | | + | | Exporting a user group. | + | | | + | | Exporting a role. | + +-----------------------------------+--------------------------------------------------------------------+ + | Cluster management | Starting a cluster. | + | | | + | | Stopping a cluster. | + | | | + | | Restarting a cluster. | + | | | + | | Performing a rolling restart of a cluster. | + | | | + | | Restarting all expired instances. | + | | | + | | Saving configurations. | + | | | + | | Synchronizing cluster configurations. | + | | | + | | Customizing cluster monitoring metrics. | + | | | + | | Configuring monitoring dumping. | + | | | + | | Saving monitoring thresholds. | + | | | + | | Downloading a client configuration file. | + | | | + | | Configuring the northbound Syslog interface. | + | | | + | | Configuring the northbound SNMP interface. | + | | | + | | Clearing alarms using SNMP. | + | | | + | | Adding a trap target using SNMP. | + | | | + | | Deleting a trap target using SNMP. | + | | | + | | Checking alarms using SNMP. | + | | | + | | Synchronizing alarms using SNMP. | + | | | + | | Creating a threshold template. | + | | | + | | Deleting a threshold template. | + | | | + | | Applying a threshold template. | + | | | + | | Saving cluster monitoring configurations. | + | | | + | | Exporting configurations. | + | | | + | | Importing cluster configurations. | + | | | + | | Exporting an installation template. | + | | | + | | Modifying a threshold template. | + | | | + | | Canceling the application of a threshold template. | + | | | + | | Masking an alarm. | + | | | + | | Sending an alarm. | + | | | + | | Changing the OMS database password. | + | | | + | | Resetting the component database password. | + | | | + | | Restarting OMM and Controller. | + | | | + | | Starting the health check of a cluster. | + | | | + | | Importing a certificate file. | + | | | + | | Configuring SSO information. | + | | | + | | Deleting historical health check reports. | + | | | + | | Modifying cluster properties. | + | | | + | | Running maintenance commands in synchronous mode. | + | | | + | | Running maintenance commands in asynchronous mode. | + | | | + | | Customizing report monitoring metrics. | + | | | + | | Exporting report monitoring data. | + | | | + | | Runing a command in asynchronous mode using SNMP. | + | | | + | | Restarting the Web service. | + | | | + | | Customizing monitoring metrics for static resource pools. | + | | | + | | Exporting monitoring data of a static resource pool. | + | | | + | | Customizing dashboard monitoring metrics. | + | | | + | | Stopping a task. | + | | | + | | Restoring configurations. | + | | | + | | Modifying domain and mutual trust configurations. | + | | | + | | Modifying system parameters. | + | | | + | | Making a cluster enter the maintenance mode. | + | | | + | | Making a cluster exit the maintenance mode. | + | | | + | | Making OMS enter the maintenance mode. | + | | | + | | Making OMS exit the maintenance mode. | + | | | + | | Making services in a cluster exit the maintenance mode in batches. | + | | | + | | Modifying OMS configurations. | + | | | + | | Enabling threshold alarms. | + | | | + | | Synchronizing all cluster configurations. | + +-----------------------------------+--------------------------------------------------------------------+ + | Service management | Starting a service. | + | | | + | | Stopping a service. | + | | | + | | Synchronizing service configurations. | + | | | + | | Refreshing a service queue. | + | | | + | | Customizing service monitoring metrics. | + | | | + | | Restarting a service. | + | | | + | | Performing a rolling service restart. | + | | | + | | Exporting service monitoring data. | + | | | + | | Importing service configuration data. | + | | | + | | Starting the health check of a service. | + | | | + | | Configuring a service. | + | | | + | | Uploading a configuration file. | + | | | + | | Downloading a configuration file. | + | | | + | | Synchronizing instance configurations. | + | | | + | | Commissioning an instance. | + | | | + | | Decommissioning an instance. | + | | | + | | Starting an instance. | + | | | + | | Stopping an instance. | + | | | + | | Customizing instance monitoring metrics. | + | | | + | | Restarting an instance. | + | | | + | | Performing a rolling restart of an instance. | + | | | + | | Exporting instance monitoring data. | + | | | + | | Importing instance configuration data. | + | | | + | | Creating an instance group. | + | | | + | | Modifying an instance group. | + | | | + | | Deleting an instance group. | + | | | + | | Moving an instance to another instance group. | + | | | + | | Making a service enter the maintenance mode. | + | | | + | | Making a service exit the maintenance mode. | + | | | + | | Changing the name of a service. | + | | | + | | Modifying service association. | + | | | + | | Downloading monitoring data. | + | | | + | | Masking alarms. | + | | | + | | Unmasking alarms. | + | | | + | | Exporting report data of a service. | + | | | + | | Adding custom parameters for a report. | + | | | + | | Modifying custom parameters of a report. | + | | | + | | Deleting custom parameters of a report. | + | | | + | | Switching over control nodes. | + | | | + | | Adding a mount table. | + | | | + | | Modifying a mount table. | + +-----------------------------------+--------------------------------------------------------------------+ + | Host management | Setting a node rack. | + | | | + | | Starting all roles. | + | | | + | | Stopping all roles. | + | | | + | | Isolating a host. | + | | | + | | Canceling isolation of a host. | + | | | + | | Customizing host monitoring metrics. | + | | | + | | Exporting host monitoring data. | + | | | + | | Making a host enter the maintenance mode. | + | | | + | | Making a host exit the maintenance mode. | + | | | + | | Exporting basic host information. | + | | | + | | Exporting host distribution report data. | + | | | + | | Exporting host trend report data. | + | | | + | | Exporting host cluster report data. | + | | | + | | Exporting report data of a service. | + | | | + | | Customizing host cluster monitoring metrics. | + | | | + | | Customizing host cluster trend monitoring metrics. | + +-----------------------------------+--------------------------------------------------------------------+ + | Alarm management | Exporting alarms. | + | | | + | | Clearing alarms. | + | | | + | | Exporting events. | + | | | + | | Clearing alarms in batches. | + +-----------------------------------+--------------------------------------------------------------------+ + | Log collection | Collecting log files. | + | | | + | | Downloading log files. | + | | | + | | Collecting service stack information. | + | | | + | | Collecting instance stack information. | + | | | + | | Preparing service stack information. | + | | | + | | Preparing instance stack information. | + | | | + | | Clearing service stack information. | + | | | + | | Clearing instance stack information. | + +-----------------------------------+--------------------------------------------------------------------+ + | Audit log management | Modifying audit dumping configurations. | + | | | + | | Exporting audit logs. | + +-----------------------------------+--------------------------------------------------------------------+ + | Data backup and restoration | Creating a backup task. | + | | | + | | Executing a backup task. | + | | | + | | Executing backup tasks in batches. | + | | | + | | Stopping a backup task. | + | | | + | | Deleting a backup task. | + | | | + | | Modifying a backup task. | + | | | + | | Locking a backup task. | + | | | + | | Unlocking a backup task. | + | | | + | | Creating a restoration task. | + | | | + | | Executing a restoration task. | + | | | + | | Stopping a restoration task. | + | | | + | | Retrying a restoration task. | + | | | + | | Deleting a restoration task. | + +-----------------------------------+--------------------------------------------------------------------+ + | Multi-tenant management | Saving static configurations. | + | | | + | | Adding a tenant. | + | | | + | | Deleting a tenant. | + | | | + | | Associating a service with a tenant. | + | | | + | | Deleting a service from a tenant. | + | | | + | | Configuring resources. | + | | | + | | Creating a resource. | + | | | + | | Deleting a resource. | + | | | + | | Adding a resource pool. | + | | | + | | Modifying a resource pool. | + | | | + | | Deleting a resource pool. | + | | | + | | Restoring tenant data. | + | | | + | | Modifying global configurations of a tenant. | + | | | + | | Modifying queue configurations of a capacity scheduler. | + | | | + | | Modifying queue configurations of a super scheduler. | + | | | + | | Modifying resource distribution of a capacity scheduler. | + | | | + | | Clearing resource distribution of a capacity scheduler. | + | | | + | | Modifying resource distribution of a super scheduler. | + | | | + | | Clearing resource distribution of a super scheduler. | + | | | + | | Adding a resource catalog. | + | | | + | | Modifying a resource catalog. | + | | | + | | Deleting a resource catalog. | + | | | + | | Customizing tenant monitoring metrics. | + +-----------------------------------+--------------------------------------------------------------------+ + | Health check | Starting the health check of a cluster. | + | | | + | | Starting the health check of a service. | + | | | + | | Starting the health check of a host. | + | | | + | | Starting the health check of OMS. | + | | | + | | Starting the system health check. | + | | | + | | Updating the health check configurations. | + | | | + | | Exporting health check reports. | + | | | + | | Exporting health check results of a cluster. | + | | | + | | Exporting health check results of a service. | + | | | + | | Exporting health check results of a host. | + | | | + | | Deleting historical health check reports. | + | | | + | | Exporting historical health check reports. | + | | | + | | Downloading a health check report. | + +-----------------------------------+--------------------------------------------------------------------+ + +.. table:: **Table 6** Component audit information + + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Audit Log | Operation Type | Operation | + +=======================+=============================================+=================================================================================================+ + | ClickHouse audit log | Maintenance management | Granting permissions. | + | | | | + | | | Revoking permissions. | + | | | | + | | | Recording authentication and login information. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Service operations | Creating databases or tables. | + | | | | + | | | Inserting, deleting, querying, and migrating data. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | DBService audit log | Maintenance management | Performing backup restoration operations. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | HBase audit log | Data definition language (DDL) statements | Creating a table. | + | | | | + | | | Deleting a table. | + | | | | + | | | Modifying a table. | + | | | | + | | | Adding a column family. | + | | | | + | | | Modifying a column family. | + | | | | + | | | Deleting a column family. | + | | | | + | | | Enabling a table. | + | | | | + | | | Disabling a table. | + | | | | + | | | Modifying user information. | + | | | | + | | | Changing a password. | + | | | | + | | | Logging in. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Data manipulation language (DML) statements | Putting data (to the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables). | + | | | | + | | | Deleting data (from the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables). | + | | | | + | | | Checking and putting data (to the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables). | + | | | | + | | | Checking and deleting data (from the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables). | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Permission control | Assigning permissions to a user. | + | | | | + | | | Canceling permission assigning. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | HDFS audit log | Permission management | Managing access permissions on files or folders. | + | | | | + | | | Managing the owner information of files or folders. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | File operations | Creating a folder. | + | | | | + | | | Creating a file. | + | | | | + | | | Opening a file. | + | | | | + | | | Appending file content. | + | | | | + | | | Changing a file name. | + | | | | + | | | Deleting a file or folder. | + | | | | + | | | Setting time property of a file. | + | | | | + | | | Setting the number of file copies. | + | | | | + | | | Merging files. | + | | | | + | | | Checking the file system. | + | | | | + | | | Linking to a file. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Hive audit log | Metadata operations | Defining metadata, such as creating databases and tables. | + | | | | + | | | Deleting metadata, such as deleting databases and tables. | + | | | | + | | | Modifying metadata, such as adding columns and renaming tables. | + | | | | + | | | Importing and exporting metadata. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Data maintenance | Loading data to a table. | + | | | | + | | | Inserting data into a table. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Permission management | Creating or deleting a role. | + | | | | + | | | Granting/Reclaiming roles. | + | | | | + | | | Granting/Reclaiming permissions. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Hue audit log | Service startup | Starting Hue. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | User operations | Logging in. | + | | | | + | | | Logging out. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Task operations | Creating a task. | + | | | | + | | | Modifying a task. | + | | | | + | | | Deleting a task. | + | | | | + | | | Submitting a task. | + | | | | + | | | Saving a task. | + | | | | + | | | Updating the status of a task. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | KrbServer audit log | Maintenance management | Changing the password of a Kerberos account. | + | | | | + | | | Adding a Kerberos account. | + | | | | + | | | Deleting a Kerberos account. | + | | | | + | | | Authenticating users. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | LdapServer audit log | Maintenance management | Adding an OS user. | + | | | | + | | | Adding a user group. | + | | | | + | | | Adding a user to a user group. | + | | | | + | | | Deleting a user. | + | | | | + | | | Deleting a group. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Loader audit log | Security management | Logging in. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Metadata management | Querying connector information. | + | | | | + | | | Querying a framework. | + | | | | + | | | Querying step information. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Data source connection management | Querying a data source connection. | + | | | | + | | | Adding a data source connection. | + | | | | + | | | Updating a data source connection. | + | | | | + | | | Deleting a data source connection. | + | | | | + | | | Activating a data source connection. | + | | | | + | | | Disabling a data source connection. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Job management | Querying a job. | + | | | | + | | | Creating a job. | + | | | | + | | | Updating a job. | + | | | | + | | | Deleting a job. | + | | | | + | | | Activating a job. | + | | | | + | | | Disabling a job. | + | | | | + | | | Querying all execution records of a job. | + | | | | + | | | Querying the latest execution record of a job. | + | | | | + | | | Submitting a job. | + | | | | + | | | Stopping a job. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | MapReduce audit log | Application running | Starting a container request. | + | | | | + | | | Stopping a container request. | + | | | | + | | | After a container request is complete, the status of the request becomes successful. | + | | | | + | | | After a container request is complete, the status of the request becomes failed. | + | | | | + | | | After a container request is complete, the status of the request becomes suspended. | + | | | | + | | | Submitting a task. | + | | | | + | | | Ending a task. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Oozie audit log | Task management | Submitting a task. | + | | | | + | | | Starting a task. | + | | | | + | | | Killing a task. | + | | | | + | | | Suspending a task. | + | | | | + | | | Resuming a task. | + | | | | + | | | Running a task again. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Spark2x audit log | Metadata operations | Defining metadata, such as creating databases and tables. | + | | | | + | | | Deleting metadata, such as deleting databases and tables. | + | | | | + | | | Modifying metadata, such as adding columns and renaming tables. | + | | | | + | | | Importing and exporting metadata. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Data maintenance | Loading data to a table. | + | | | | + | | | Inserting data into a table. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Storm audit log | Nimbus operations | Submitting a topology. | + | | | | + | | | Stopping a topology. | + | | | | + | | | Reallocating a topology. | + | | | | + | | | Deactivating a topology. | + | | | | + | | | Activating a topology. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | UI operations | Stopping a topology. | + | | | | + | | | Reallocating a topology. | + | | | | + | | | Deactivating a topology. | + | | | | + | | | Activating a topology. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | Yarn audit log | Job submission | Submitting a job to a queue. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | ZooKeeper audit log | Permission management | Setting access permissions to Znode. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + | | Znode operations | Creating Znodes. | + | | | | + | | | Deleting Znodes. | + | | | | + | | | Configuring Znode data. | + +-----------------------+---------------------------------------------+-------------------------------------------------------------------------------------------------+ + +FusionInsight Manager audit logs are stored in the database. You can view and export the audit logs on the **Audit** page. + +The following table lists the directories to store component audit logs. Audit log files of some components are stored in **/var/log/Bigdata/audit**, such as HDFS, HBase, MapReduce, Hive, Hue, Yarn, Storm, and ZooKeeper. The component audit logs are automatically compressed and backed up to **/var/log/Bigdata/audit/bk** at 03: 00 every day. A maximum of latest 90 compressed backup files are retained, and the backup time cannot be changed. For details about how to configure the number of reserved audit log files, see :ref:`Configuring the Number of Local Audit Log Backups `. + +Audit log files of other components are stored in the component log directory. + +.. table:: **Table 7** Directories for storing component audit logs + + +-----------------------------------+-------------------------------------------------------------------------+ + | Component | Audit Log Directory | + +===================================+=========================================================================+ + | DBService | /var/log/Bigdata/audit/dbservice/dbservice_audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | HBase | /var/log/Bigdata/audit/hbase/hm/hbase-audit-hmaster.log | + | | | + | | /var/log/Bigdata/audit/hbase/hm/hbase-ranger-audit-hmaster.log | + | | | + | | /var/log/Bigdata/audit/hbase/rs/hbase-audit-regionserver.log | + | | | + | | /var/log/Bigdata/audit/hbase/rs/hbase-ranger-audit-regionserver.log | + | | | + | | /var/log/Bigdata/audit/hbase/rt/hbase-audit-restserver.log | + | | | + | | /var/log/Bigdata/audit/hbase/ts/hbase-audit-thriftserver.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | HDFS | /var/log/Bigdata/audit/hdfs/nn/hdfs-audit-namenode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/nn/ranger-plugin-audit.log | + | | | + | | /var/log/Bigdata/audit/hdfs/dn/hdfs-audit-datanode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/jn/hdfs-audit-journalnode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/zkfc/hdfs-audit-zkfc.log | + | | | + | | /var/log/Bigdata/audit/hdfs/httpfs/hdfs-audit-httpfs.log | + | | | + | | /var/log/Bigdata/audit/hdfs/router/hdfs-audit-router.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Hive | /var/log/Bigdata/audit/hive/hiveserver/hive-audit.log | + | | | + | | /var/log/Bigdata/audit/hive/hiveserver/hive-rangeraudit.log | + | | | + | | /var/log/Bigdata/audit/hive/metastore/metastore-audit.log | + | | | + | | /var/log/Bigdata/audit/hive/webhcat/webhcat-audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Hue | /var/log/Bigdata/audit/hue/hue-audits.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Kafka | /var/log/Bigdata/audit/kafka/audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Loader | /var/log/Bigdata/loader/audit/default.audit | + +-----------------------------------+-------------------------------------------------------------------------+ + | MapReduce | /var/log/Bigdata/audit/mapreduce/jobhistory/mapred-audit-jobhistory.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Oozie | /var/log/Bigdata/audit/oozie/oozie-audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Spark2x | /var/log/Bigdata/audit/spark2x/jdbcserver/jdbcserver-audit.log | + | | | + | | /var/log/Bigdata/audit/spark2x/jdbcserver/ranger-audit.log | + | | | + | | /var/log/Bigdata/audit/spark2x/jobhistory/jobhistory-audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Storm | /var/log/Bigdata/audit/storm/logviewer/audit.log | + | | | + | | /var/log/Bigdata/audit/storm/nimbus/audit.log | + | | | + | | /var/log/Bigdata/audit/storm/supervisor/audit.log | + | | | + | | /var/log/Bigdata/audit/storm/ui/audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Yarn | /var/log/Bigdata/audit/yarn/rm/yarn-audit-resourcemanager.log | + | | | + | | /var/log/Bigdata/audit/yarn/rm/ranger-plugin-audit.log | + | | | + | | /var/log/Bigdata/audit/yarn/nm/yarn-audit-nodemanager.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | ZooKeeper | /var/log/Bigdata/audit/zookeeper/quorumpeer/zk-audit-quorumpeer.log | + +-----------------------------------+-------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_log_level_and_log_file_size.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_log_level_and_log_file_size.rst new file mode 100644 index 0000000..071b698 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_log_level_and_log_file_size.rst @@ -0,0 +1,66 @@ +:original_name: admin_guide_000195.html + +.. _admin_guide_000195: + +Configuring the Log Level and Log File Size +=========================================== + +Scenario +-------- + +You can change the log levels of FusionInsight Manager. For a specific service, you can change the log level and the log file size to prevent the failure in saving logs due to insufficient disk space. + +Impact on the System +-------------------- + +The services need to be restarted for the new configuration to take effect. During the restart, the services are unavailable. + +Changing the FusionInsight Manager Log Level +-------------------------------------------- + +#. Log in to the active management node as user **omm**. + +#. Run the following command to switch to the required directory: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + +#. Run the following command to change the log level: + + **./setLogLevel.sh**\ *Log level parameters* + + The priorities of log levels are FATAL, ERROR, WARN, INFO, and DEBUG in descending order. Logs whose levels are higher than or equal to the set level are printed. The number of printed logs decreases as the configured log level increases. + + - **DEFAULT**: After this parameter is set, the default log level is used. + - **FATAL**: critical error log level. After this parameter is set, only logs of the **FATAL** level are printed. + - **ERROR**: error log level. After this parameter is set, logs of the **ERROR** and **FATAL** levels are printed. + - **WARN**: warning log level. After this parameter is set, logs of the **WARN**, **ERROR**, and **FATAL** levels are printed. + - **INFO** (default): informational log level. After this parameter is set, logs of the **INFO**, **WARN**, **ERROR**, and **FATAL** levels are printed. + - **DEBUG**: debugging log level. After this parameter is set, logs of the **DEBUG**, **INFO**, **WARN**, **ERROR**, and **FATAL** levels are printed. + - **TRACE**: tracing log level. After this parameter is set, logs of the **TRACE**, **DEBUG**, **INFO**, **WARN**, **ERROR**, and **FATAL** levels are printed. + + .. note:: + + The log levels of components are different from those defined in open-source code. + +#. Download and view logs to verify that the log level settings have taken effect. For details, see :ref:`Log `. + +Changing the Service Log Level and Log File Size +------------------------------------------------ + +.. note:: + + KrbServer, LdapServer, and DBService do not support the changing of service log levels and log file sizes. + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services**. +#. Click a service in the service list. On the displayed page, click the **Configuration** page. +#. On the displayed page, click the **All Configuration** tab. Expand the role instance displayed on the left of the page. Click **Log** of the role to be modified. +#. Search for each parameter and obtain the parameter description. On the parameter configuration page, select the required log level or change the log file size. The unit of the log file size is MB. + + .. important:: + + - The system automatically deletes logs based on the configured log size. To save more information, set the log file size to a larger value. To ensure the integrity of log files, you are advised to manually back up the log files to another directory based on the actual service volume before the log files are cleared according to clearance rules. + - Some services do not support change of the log level on the UI. + +#. Click **Save**. In the **Save Configuration** dialog box, click **OK**. +#. Download and view logs to verify that the log level settings have taken effect. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_number_of_local_audit_log_backups.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_number_of_local_audit_log_backups.rst new file mode 100644 index 0000000..ea4d9b6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/configuring_the_number_of_local_audit_log_backups.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000196.html + +.. _admin_guide_000196: + +Configuring the Number of Local Audit Log Backups +================================================= + +Scenario +-------- + +Audit logs of cluster components are classified by name and stored in the **/var/log/Bigdata/audit** directory on each cluster node. The OMS automatically backs up the audit log directories at 03:00 every day. + +The audit log directory on each node is compressed and named in the **\ **.tar.gz** format. All compressed files are compressed and named in the **\ **.tar.gz** format and saved in the **/var/log/Bigdata/audit/bk/** directory on the active management node. In addition, the standby management node saves a copy of the file. + +By default, a maximum of 90 OMS backup files can be retained. This section describes how to configure the maximum number. + +Procedure +--------- + +#. Log in to the active management node as user **omm**. + + .. note:: + + Perform this operation only on the active management node. This operation is not supported on the standby management nodes; otherwise, the cluster cannot work properly. + +#. Run the following command to switch to the required directory: + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + +#. Run the following command to change the maximum number of audit log backup files to be retained: + + **./modifyLogConfig.sh -m** **Maximum number of backup files that can be retained** + + The default value is **90**. The value ranges from **0** to **365**. A larger value means to consume more disk space. + + If the following information is displayed, the operation is successful: + + .. code-block:: + + Modify log config successfully diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/index.rst new file mode 100644 index 0000000..f63dd67 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000192.html + +.. _admin_guide_000192: + +Log Management +============== + +- :ref:`About Logs ` +- :ref:`Manager Log List ` +- :ref:`Configuring the Log Level and Log File Size ` +- :ref:`Configuring the Number of Local Audit Log Backups ` +- :ref:`Viewing Role Instance Logs ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + about_logs + manager_log_list + configuring_the_log_level_and_log_file_size + configuring_the_number_of_local_audit_log_backups + viewing_role_instance_logs diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/manager_log_list.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/manager_log_list.rst new file mode 100644 index 0000000..91b9b34 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/manager_log_list.rst @@ -0,0 +1,373 @@ +:original_name: admin_guide_000194.html + +.. _admin_guide_000194: + +Manager Log List +================ + +Log Description +--------------- + +**Log path**: The default storage path of Manager log files is **/var/log/Bigdata/**\ *Manager component*. + +- ControllerService: **/var/log/Bigdata/controller/** (OMS installation and run logs) +- HTTPd: **/var/log/Bigdata/httpd** (HTTPd installation and run logs) +- Logman: **/var/log/Bigdata/logman** (log packaging tool logs) +- NodeAgent: **/var/log/Bigdata/nodeagent** (NodeAgent installation and run logs) +- okerberos: **/var/log/Bigdata/okerberos** (okerberos installation and run logs) +- oldapserver: **/var/log/Bigdata/oldapserver** (oldapserver installation and run logs) +- MetricAgent: **/var/log/Bigdata/metric_agent** (MetricAgent run logs) +- OMM: **/var/log/Bigdata/omm** (OMM installation and run logs) +- Timestamp: **/var/log/Bigdata/timestamp** (NodeAgent startup time logs) +- Tomcat: **/var/log/Bigdata/tomcat** (Web process logs) +- Watchdog: **/var/log/Bigdata/watchdog** (watchdog logs) +- Upgrade: **/var/log/Bigdata/upgrade** (OMS upgrade logs) +- UpdateService: **/var/log/Bigdata/update-service** (upgrade service logs) +- Sudo: **/var/log/Bigdata/sudo** (sudo script execution logs) +- OS: **/var/log/**\ *message file* (OS system logs) +- OS performance: **/var/log/osperf** (OS performance statistics logs) +- OS statistics: **/var/log/osinfo/statistics** (OS parameter configuration logs) + +**Log archive rule**: + +The automatic compression and archiving function is enabled for Manager logs. By default, when the size of a log file exceeds 10 MB, the log file is automatically compressed. The naming rule of a compressed log file is as follows: <*Original log name*>-<*yyyy-mm-dd_hh-mm-ss*>.[*ID*].\ **log.zip** A maximum of 20 latest compressed files are retained. + +.. table:: **Table 1** Manager logs + + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Log File Name | Description | + +=====================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+===============================================================================================================================+ + | Controller run logs | controller.log | Log file that records component installation, upgrade, configuration, monitoring, alarm reporting, and routine O&M operations | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_client.log | Run log file of the Representational State Transfer (REST) APIs | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | acs.log | ACS run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | acs_spnego.log | spnego user logs in ACS | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | aos.log | AOS run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | plugin.log | AOS plug-in logs | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | backupplugin.log | Run log file that records the backup and restoration operations | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_config.log | Configuration run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_nodesetup.log | Controller loading task log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_root.log | System log file of the Controller process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_trace.log | Log file that records the remote procedure call (RPC) communication between Controller and NodeAgent | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_monitor.log | Monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_fsm.log | State machine log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_alarm.log | Controller alarm log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller_backup.log | Controller backup and recovery log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | install.log, restore_package.log, installPack.log, distributeAdapterFiles.log, and install_os_optimization.log | OMS installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oms_ctl.log | OMS startup and stop log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | preInstall_client.log | Preprocessing log file before client installation | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | installntp.log | NTP installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | modify_manager_param.log | Manager parameter modification log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | backup.log | OMS backup script run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | supressionAlarm.log | Alarm script run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | om.log | OM certificate generation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | backupplugin_ctl.log | Startup log file of the backup and restoration plug-in process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | getLogs.log | Run log of the log collection script | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | backupAuditLogs.log | Run log of the audit log backup script | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | certStatus.log | Log file that records regular certificate checks | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | distribute.log | Certificate distribution log | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ficertgenetrate.log | Certificate replacement log file, covering level-2 certificates, CAS certificates, and HTTPd certificates | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | genPwFile.log | Log file that records the generation of certificate password files | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | modifyproxyconf.log | Log file that records the modification of the HTTPd proxy configuration | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | importTar.log | Log file that records the process for importing certificates into the trust store. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | HTTPd | install.log | HTTPd installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | access_log, error_log | HTTPd run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Logman | logman.log | Log packaging tool log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | NodeAgent | install.log and install_os_optimization.log | NodeAgent installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | installntp.log | NTP installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | start_ntp.log | NTP startup log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ntpChecker.log | NTP check log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ntpMonitor.log | NTP monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | heartbeat_trace.log | Log file that records heartbeats between NodeAgent and Controller | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | alarm.log | Alarm log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | monitor.log | Monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | nodeagent_ctl.log and start-agent.log | NodeAgent startup log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | agent.log | NodeAgent run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | cert.log | Certificate log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | agentplugin.log | Log file that records the Agent plug-in running status | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | omaplugin.log | OMA plug-in run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | diskhealth.log | Disk health check log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | supressionAlarm.log | Alarm script run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | updateHostFile.log | Host list update log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | collectLog.log | Run log file of the node log collection script | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | host_metric_collect.log | Run log file of host metric collection | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | checkfileconfig.log | Run log file of file permission check | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | entropycheck.log | Entropy check run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | timer.log | Log file of scheduled node scheduling | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | pluginmonitor.log | Component monitoring plug-in log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | agent_alarm_py.log | Log file that records alarms upon insufficient NodeAgent file permission | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | oKerberos | addRealm.log and modifyKerberosRealm.log | Realm handover log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | checkservice_detail.log | Okerberos health check log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | genKeytab.log | keytab generation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | KerberosAdmin_genConfigDetail.log | Run log file of **kadmin.conf** generated during start of the kadmin process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | KerberosServer_genConfigDetail.log | Run log file of **krb5kdc.conf** generated during start of the krb5kdc process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oms-kadmind.log | Run log file of the kadmin process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oms_kerberos_install.log and postinstall_detail.log | Okerberos installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oms-krb5kdc.log | Run log file of the krbkdc process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | start_detail.log | Okerberos startup log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | realmDataConfigProcess.log | Log file that records the rollback upon a realm handover failure | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | stop_detail.log | Okerberos stop log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | oldapserver | ldapserver_backup.log | Oldapserver backup log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_chk_service.log | Oldapserver health check log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_install.log | Oldapserver installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_start.log | Oldapserver startup log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_status.log | Log file that records the status of the Oldapserver process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_stop.log | Oldapserver stop log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_wrap.log | Oldapserver service management log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_uninstall.log | Oldapserver uninstallation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | restart_service.log | Oldapserver restart log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_unlockUser.log | Log file that records information about unlocking LDAP users and managing accounts | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | metric_agent | gc.log | MetricAgent JVM GC log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | metric_agent.log | Run log file of MetricAgent | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | metric_agent_qps.log | Log file that records MetricAgent Internal queue length and QPS information | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | metric_agent_root.log | All run log files of MetricAgent | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | start.log | Log file that records information about the MetricAgent startup and stop | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | OMM | omsconfig.log | OMS configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | check_oms_heartbeat.log | OMS heartbeat log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | monitor.log | OMS monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ha_monitor.log | HA_Monitor operation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ha.log | HA operation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | fms.log | Alarm log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | fms_ha.log | HA alarm monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | fms_script.log | Alarm control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | Alarm configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | iam.log | IAM log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | iam_script.log | IAM control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | iam_ha.log | IAM HA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | IAM configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | operatelog.log | IAM operation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | heartbeatcheck_ha.log | OMS heartbeat HA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | install_oms.log | OMS installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | pms_ha.log | HA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | pms_script.log | Monitoring control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | Monitoring configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | plugin.log | Monitoring plug-in run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | pms.log | Monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ha.log | HA run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | cep_ha.log | CEP HA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | cep_script.log | CEP control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | cep.log | CEP log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | CEP configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | omm_gaussdba.log | GaussDB HA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | gaussdb-.log | GaussDB run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | gs_ctl-.log | Archive log file of GaussDB control logs | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | gs_ctl-current.log | GaussDB control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | gs_guc-current.log | GaussDB operation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | encrypt.log | OMM encryption log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | omm_agent_ctl.log | OMA control log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oma_monitor.log | OMA monitoring log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | install_oma.log | OMA installation log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | config_oma.log | OMA configuration log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | omm_agent.log | OMA run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | acs.log | ACS resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | aos.log | AOS resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | controller.log | Controller resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | floatip.log | Floating IP address resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ha_ntp.log | NTP resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | httpd.log | HTTPd resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | okerberos.log | Okerberos resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | oldap.log | OLdap resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | tomcat.log | Tomcat resource log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | send_alarm.log | Run log file of the HA alarm sending script of the management node | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | feed_watchdog.log | feed_watchdog resource log | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Timestamp | restart_stamp | NodeAgent startup time log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Tomcat | cas.log and localhost_access_cas_log.log | CAS run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | catalina.log, catalina.out, host-manager.log, localhost.log, and manager.log | Tomcat run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | localhost_access_web_log.log | Log file that records the access to REST APIs of FusionInsight Manager | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | web.log | Run log file of the Web process | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | northbound_ftp_sftp.log and snmp.log | Northbound log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | perfStats.log | Performance statistics log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Watchdog | watchdog.log and feed_watchdog.log | watchdog.log run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | update-service | omm_upd_server.log | UPDServer run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | omm_upd_agent.log | UPDAgent run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | update-manager.log | UPDManager run log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | install.log | Installation log file of the upgrade service | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | uninstall.log | Uninstallation log file of the upgrade service | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | catalina.<*Time*>.log, catalina.out, host-manager.<*Time*>.log, localhost.<*Time*>.log, manager.<*Time*>.log, manager_access_log.<*Time*>.txt, web_service_access_log.<*Time*>.txt, catalina.log, gc-update-service.log.0.current, update-manager.controller, update-web-service.controller, update-web-service.log, commit_rm_distributed.log, commit_rm_upload_package.log, common_omagent_operator.log, forbid_monitor.log, initialize_package_atoms.log, initialize_unzip_pack.log, omm-upd.log, register_patch_pack.log, resume_monitor.logrollback_clear_patch.log, unregister_patch_pack.log, update-rcommupd.log, update-rcupdatemanager.log, and update-service.log | Run log file of the upgrade service | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Upgrade | upgrade.log_<*Time*> | OMS upgrade log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | rollback.log_<*Time*> | OMS rollback log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | sudo | sudo.log | Sudo script execution log file | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + +Log Levels +---------- + +:ref:`Table 2 ` describes the log levels provided by Manager. The log levels are FATAL, ERROR, WARN, INFO, and DEBUG in descending order. Logs whose levels are higher than or equal to the set level are printed by the program. The number of printed logs decreases as the set log level increases. + +.. _admin_guide_000194__tce0bb52db5fc4d53a43987beff277cb7: + +.. table:: **Table 2** Log levels + + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | Level | Description | + +=======+==========================================================================================================================================+ + | FATAL | Logs of this level record fatal error information about the current event processing that may result in a system crash. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | ERROR | Logs of this level record error information about the current event processing, which indicates that system running is abnormal. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | WARN | Logs of this level record abnormal information about the current event processing. These abnormalities will not result in system faults. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | INFO | Logs of this level record normal running status information about the system and events. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | DEBUG | Logs of this level record system information and debugging information. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + +Log Formats +----------- + +The following table lists the Manager log formats. + +.. table:: **Table 3** Log formats + + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Component | Format | Example | + +========================================================================================+========================================================================================+======================================================================================================================================================+=============================================================================================================================================================================+ + | Controller, HTTPd, Logman, NodeAgent, oKerberos, oldapserver, OMM, Tomcat, and upgrade | Controller, HTTPd, Logman, NodeAgent, oKerberos, oldapserver, OMM, Tomcat, and upgrade | ||<*Name of the thread for which the log is generated*>|<*Log message*>|<*Location where the log event occurs*> | 2020-06-30 00:37:09,067 INFO [pool-1-thread-1] Completed Discovering Node. com.xxx.hadoop.om.controller.tasks.nodesetup.DiscoverNodeTask.execute(DiscoverNodeTask.java:299) | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/viewing_role_instance_logs.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/viewing_role_instance_logs.rst new file mode 100644 index 0000000..8372310 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/log_management/viewing_role_instance_logs.rst @@ -0,0 +1,31 @@ +:original_name: admin_guide_000197.html + +.. _admin_guide_000197: + +Viewing Role Instance Logs +========================== + +Scenario +-------- + +FusionInsight Manager allows users to view logs of each role instance. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster**, click the name of the desired cluster, choose **Services**, and click a service name. Then click the **Instance** tab of the service and click the name of the target instance to access the instance status page. + +#. In the **Log** area, click the name of a log file to preview its content online. + + .. note:: + + - On the **Hosts** page, click a host name. In the instance list of the host, you can view the log files of all role instances on the host. + - By default, a maximum of 100 lines of logs can be displayed. You can click **Load More** to view more logs. Click **Download** to download the log file to the local PC. For details about how to download service logs in batches, see :ref:`Log Download `. + + + .. figure:: /_static/images/en-us_image_0000001318160568.png + :alt: **Figure 1** Viewing instance logs + + **Figure 1** Viewing instance logs diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_alarm_masking_status.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_alarm_masking_status.rst new file mode 100644 index 0000000..cde5aa9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_alarm_masking_status.rst @@ -0,0 +1,48 @@ +:original_name: admin_guide_000072.html + +.. _admin_guide_000072: + +Configuring the Alarm Masking Status +==================================== + +Scenario +-------- + +If you do not want FusionInsight Manager to report specified alarms in the following scenarios, you can manually mask the alarms. + +- Some unimportant alarms and minor alarms need to be masked. +- When a third-party product is integrated with FusionInsight, some alarms of the product are duplicated with the alarms of FusionInsight and need to be masked. +- When the deployment environment is special, certain alarms may be falsely reported and need to be masked. + +After an alarm is masked, new alarms with the same ID as the alarm are neither displayed on the **Alarm** page nor counted. The reported alarms are still displayed. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Alarm** > **Masking Setting**. + +#. In the **Masking Setting** area, select the specified service or module. + +#. Select an alarm from the alarm list. + + + .. figure:: /_static/images/en-us_image_0000001369953797.png + :alt: **Figure 1** Masking an alarm + + **Figure 1** Masking an alarm + + The information about the alarm is displayed, including the alarm name, ID, severity, masking status, and operations can be performed on the alarm. + + - The masking status includes **Display** and **Masking**. + - Operations include **Masking** and **Help**. + + .. note:: + + You can filter specified alarms based on the masking status and alarm severity. + +#. Set the masking status for an alarm: + + - Click **Masking**. In the displayed dialog box, click **OK** to change the alarm masking status to **Masking**. + - Click **Cancel Masking**. In the dialog box that is displayed, click **OK** to change the masking status of the alarm to **Display**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_threshold.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_threshold.rst new file mode 100644 index 0000000..be3d920 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/configuring_the_threshold.rst @@ -0,0 +1,363 @@ +:original_name: admin_guide_000071.html + +.. _admin_guide_000071: + +Configuring the Threshold +========================= + +Scenario +-------- + +You can configure monitoring indicator thresholds to monitor the health status of indicators on FusionInsight Manager. If abnormal data occurs and the preset conditions are met, the system triggers an alarm and displays the alarm information on the alarm page. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Alarm** > **Thresholds**. + +#. Select a monitoring metric for a host or service in the cluster. + + + .. figure:: /_static/images/en-us_image_0263899436.png + :alt: **Figure 1** Configuring the threshold for a metric + + **Figure 1** Configuring the threshold for a metric + + For example, after selecting **Host Memory Usage**, the information about this indicator threshold is displayed. + + - If the alarm sending switch is displayed as |image1|, an alarm is triggered if the threshold is reached. + - **Alarm ID** and **Alarm Name**: alarm information triggered against the threshold + - **Trigger Count**: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of **Trigger Count**, an alarm is generated. **Trigger Count** is configurable. + - **Check Period (s)**: interval for the system to check the monitoring metric. + - The rules in the rule list are used to trigger alarms. + +#. Click **Create Rule** to add rules used for monitoring indicators. + + .. table:: **Table 1** Monitoring indicator rule parameters + + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Parameter | Description | Example Value | + +=======================+======================================================================================================================================================================================================================================================================================================================================================================================================+=================================+ + | Rule Name | Name of a rule. | CPU_MAX | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Severity | Alarm Severity | - Critical | + | | | - Major | + | | - Critical | - Minor | + | | - Major | - Warning | + | | - Minor | | + | | - Warning | | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Threshold Type | You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If **Threshold Type** is set to **Max value**, the system generates an alarm when the value of the specified indicator is greater than the threshold. If **Threshold Type** is set to **Min value**, the system generates an alarm when the value of the specified indicator is less than the threshold. | - Max value | + | | | - Min value | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Date | This parameter is used to set the date when the rule takes effect. | - Daily | + | | | - Weekly | + | | | - Others | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Add Date | This parameter is available only when **Date** is set to **Others**. You can set the date when the rule takes effect. Multiple options are available. | 09-30 | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Thresholds | This parameter is used to set the time range when the rule takes effect. | Start and End Time: 00:00-08:30 | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | | Threshold of the rule monitoring metric | Threshold: 10 | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + + .. note:: + + You can click |image2| or |image3| to add or delete time thresholds. + +#. Click **OK** to save the rules. + +#. Locate the row that contains an added rule, and click **Apply** in the **Operation** column. The value of **Effective** for this rule changes to **Yes**. + + A new rule can be applied only after you click **Cancel** for an existing rule. + +Monitoring Metric Reference +--------------------------- + +FusionInsight Manager alarm monitoring metrics are classified as node information metrics and cluster service metrics. :ref:`Table 2 ` describes the metrics for which you can configure thresholds on nodes. + +.. _admin_guide_000071__table4447741: + +.. table:: **Table 2** Node monitoring metrics + + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Metric Group | Metric | Description | Default Threshold | + +=================+===============================+=======================================================================================================================================================================================================================+===================+ + | CPU | Host CPU Usage | This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster. | 90.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Disk | Disk Usage | Indicates the disk usage of a host. | 90.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Disk Inode Usage | Indicates the disk inode usage in a measurement period. | 80.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Memory | Host Memory Usage | Indicates the average memory usage at the current time. | 90.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Host Status | Host File Handle Usage | Indicates the usage of file handles of the host in a measurement period. | 80.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Host PID Usage | Indicates the PID usage of a host. | 90% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Network Status | TCP Ephemeral Port Usage | Indicates the usage of temporary TCP ports of the host in a measurement period. | 80.0% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Network Reading | Read Packet Error Rate | Indicates the read packet error rate of the network interface on the host in a measurement period. | 0.5% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Read Packet Dropped Rate | Indicates the read packet dropped rate of the network interface on the host in a measurement period. | 0.5% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Read Throughput Rate | Indicates the average read throughput (at MAC layer) of the network interface in a measurement period. | 80% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Network Writing | Write Packet Error Rate | Indicates the write packet error rate of the network interface on the host in a measurement period. | 0.5% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Write Packet Dropped Rate | Indicates the write packet dropped rate of the network interface on the host in a measurement period. | 0.5% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Write Throughput Rate | Indicates the average write throughput (at MAC layer) of the network interface in a measurement period. | 80% | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Process | Uninterruptible Sleep Process | Number of D state processes on the host in a measurement period | 0 | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | omm Process Usage | omm process usage in a measurement period | 90 | + +-----------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + +.. table:: **Table 3** Cluster service indicators + + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Service | Monitoring Indicator Group Name | Indicator Name | Description | Default Threshold | + +============+=================================+============================================================================================================+=============================================================================================================================================+===================+ + | DBService | Database | Usage of the Number of Database Connections | Indicates the usage of the number of database connections. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Disk Space Usage of the Data Directory | Disk space usage of the data directory | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Flume | Agent | Heap Memory Usage Calculate | Indicates the Flume heap memory usage. | 95.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Flume Direct Memory Usage Statistics | Indicates the Flume direct memory usage. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Flume Non-heap Memory Usage | Indicates the Flume non-heap memory usage. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Total GC duration of Flume process | Indicates the Flume total GC time. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | HBase | GC | GC time for old generation | Total GC time of RegionServer | 5000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | GC time for old generation | Indicates he total GC time of HMaster. | 5000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | CPU & memory | RegionServer Direct Memory Usage Statistics | Indicates theRegionServerReg direct memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | RegionServer Heap Memory Usage Statistics | Indicates the RegionServer heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | HMaster Direct Memory Usage | Indicates the HMaster direct memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | HMaster Heap Memory Usage Statistics | Indicates the HMaster heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | **Service** | Number of Online Regions of a RegionServer | Number of regions of a RegionServer | 2000 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Region in transaction count over threshold | Number of regions that are in the RIT state and reach the threshold duration | 1 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Replication | Replication sync failed times (RegionServer) | Indicates the number of times that DR data fails to be synchronized. | 1 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Number of Log Files to Be Synchronized in the Active Cluster | Number of log files to be synchronized in the active cluster | 128 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Number of HFiles to Be Synchronized in the Active Cluster | Number of HFiles to be synchronized in the active cluster | 128 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Queue | Compaction Queue Size | Size of the Compaction queue | 100 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | HDFS | File and Block | Lost Blocks | Indicates the number of block copies that the HDFS lacks of. | 0 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Blocks Under Replicated | Total number of blocks that need to be replicated by the NameNode | 1000 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | RPC | Average Time of Active NameNode RPC Processing | Indicates the average RPC processing time. | 100 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Average Time of Active NameNode RPC Queuing | Indicates the average RPC queuing time. | 200 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Disk | HDFS Disk Usage | Indicates the HDFS disk usage. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | DataNode Disk Usage | Indicates the disk usage of DataNodes in the HDFS. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Percentage of Reserved Space for Replicas of Unused Space | Indicates the percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Resource | Faulty DataNodes | Indicates the number of faulty DataNodes. | 3 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | NameNode Non Heap Memory Usage Statistics | Indicates the percentage of NameNode non-heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | NameNode Direct Memory Usage Statistics | Indicates the percentage of direct memory used by NameNodes. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | NameNode Heap Memory Usage Statistics | Indicates the percentage of NameNode non-heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | DataNode Direct Memory Usage Statistics | Indicates the percentage of direct memory used by DataNodes. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | DataNode Heap Memory Usage Statistics | DataNode heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | DataNode Heap Memory Usage Statistics | Indicates the percentage of DataNode non-heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Garbage Collection | GC Time (NameNode)/GC Time (DataNode) | Indicates the Garbage collection (GC) duration of NameNodes per minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | GC Time | Indicates the GC duration of DataNodes per minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Hive | HQL | Percentage of HQL Statements That Are Executed Successfully by Hive | Indicates the percentage of HQL statements that are executed successfully by Hive. | 90.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Background | Background Thread Usage | Background thread usage | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC | Total GC time of MetaStore | Indicates the total GC time of MetaStore. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Total GC Time in Milliseconds | Indicates the total GC time of HiveServer. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Capacity | Percentage of HDFS Space Used by Hive to the Available Space | Indicates the percentage of HDFS space used by Hive to the available space. | 85.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | CPU & memory | MetaStore Direct Memory Usage Statistics | MetaStore direct memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | MetaStore Non-Heap Memory Usage Statistics | MetaStore non-heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | MetaStore Heap Memory Usage Statistics | MetaStore heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | HiveServer Direct Memory Usage Statistics | HiveServer direct memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | HiveServer Non-Heap Memory Usage Statistics | HiveServer non-heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | HiveServer Heap Memory Usage Statistics | HiveServer heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Session | Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer | Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer. | 90.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Kafka | Partition | Percentage of Partitions That Are Not Completely Synchronized | Indicates the percentage of partitions that are not completely synchronized to total partitions. | 50% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Others | Unavailable Partition Percentage | Percentage of unavailable partitions of each Kafka topic | 40% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | User Connection Usage on Broker | Usage of user connections on Broker | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Disk | Broker Disk Usage | Indicates the disk usage of the disk where the Broker data directory is located. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Disk I/O Rate of a Broker | I/O usage of the disk where the Broker data directory is located | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Process | Broker GC Duration per Minute | Indicates the GC duration of the Broker process per minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Heap Memory Usage of Kafka | Indicates the Kafka heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Kafka Direct Memory Usage | Indicates the Kafka direct memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Loader | Memory | Heap Memory Usage Calculate | Indicates the Loader heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Direct Memory Usage of Loader | Indicates the Loader direct memory usage. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Non-heap Memory Usage of Loader | Indicates the Loader non-heap memory usage. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC | Total GC time of Loader | Indicates the total GC time of Loader. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | MapReduce | Garbage Collection | GC Time | Indicates the GC time. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Resource | JobHistoryServer Direct Memory Usage Statistics | Indicates the JobHistoryServer direct memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JobHistoryServer Non Heap Memory Usage Statistics | Indicates the JobHistoryServer non-heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JobHistoryServer Heap Memory Usage Statistics | Indicates the JobHistoryServer non-heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Oozie | Memory | Heap Memory Usage Calculate | Indicates the Oozie heap memory usage. | 95.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Oozie Direct Memory Usage | Indicates the Oozie direct memory usage. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Oozie Non-heap Memory Usage | Indicates the Oozie non-heap memory usage. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC | Total GC duration of Oozie | Indicates the Oozie total GC time. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Spark2x | Memory | JDBCServer2x Heap Memory Usage Statistics | JDBCServer2x heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JDBCServer2x Direct Memory Usage Statistics | JDBCServer2x direct memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JDBCServer2x Non-Heap Memory Usage Statistics | JDBCServer2x non-heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JobHistory2x Direct Memory Usage Statistics | JobHistory2x direct memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JobHistory2x Non-Heap Memory Usage Statistics | JobHistory2x non-heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | JobHistory2x Heap Memory Usage Statistics | JobHistory2x heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | IndexServer2x Direct Memory Usage Statistics | IndexServer2x direct memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | IndexServer2x Heap Memory Usage Statistics | IndexServer2x heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | IndexServer2x Non-Heap Memory Usage Statistics | IndexServer2x non-heap memory usage | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC Count | Full GC Number of JDBCServer2x | Total GC number of JDBCServer2x | 12 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Full GC Number of JobHistory2x | Total GC number of JobHistory2x | 12 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Full GC Number of IndexServer2x | Total GC number of IndexServer2x | 12 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC Time | Total GC Time in Milliseconds | Total GC time of JDBCServer2x | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Total GC Time in Milliseconds | Total GC time of JobHistory2x | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Total GC Time in Milliseconds | Total GC time of IndexServer2x | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Storm | Cluster | Number of Available Supervisors | Indicates the number of available Supervisor processes in the cluster in a measurement period. | 1 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Slot Usage | Indicates the slot usage in the cluster in a measurement period. | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Nimbus | Heap Memory Usage Calculate | Indicates the Nimbus heap memory usage. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Yarn | Resources | NodeManager Direct Memory Usage Statistics | Indicates the percentage of direct memory used by NodeManagers. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | NodeManager Heap Memory Usage Statistics | Indicates the percentage of NodeManager heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | NodeManager Non Heap Memory Usage Statistics | Indicates the percentage of NodeManager non-heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | ResourceManager Direct Memory Usage Statistics | Indicates the Kafka direct memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | ResourceManager Heap Memory Usage Statistics | Indicates the ResourceManager heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | ResourceManager Non Heap Memory Usage Statistics | Indicates the ResourceManager non-heap memory usage. | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Garbage collection | GC Time | Indicates the GC duration of NodeManager per minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | GC Time | Indicates the GC duration of ResourceManager per minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Others | Failed Applications of root queue | Number of failed tasks in the root queue | 50 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Terminated Applications of root queue | Number of killed tasks in the root queue | 50 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | CPU & memory | Pending Memory | Pending memory capacity | 83886080MB | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | Application | Pending Applications | Pending tasks | 60 | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | ZooKeeper | Connection | ZooKeeper Connections Usage | Indicates the percentage of the used connections to the total connections of ZooKeeper. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | CPU & memory | Directmemory Usage Calculate | Indicates the ZooKeeper heap memory usage. | 95% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Heap Memory Usage Calculate | Indicates the ZooKeeper direct memory usage. | 80% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | GC | ZooKeeper GC Duration per Minute | Indicates the GC time of ZooKeeper every minute. | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | meta | OBS data write operation | Success Rate for Calling the OBS Write API | Success rate for calling the OBS data read API | 99.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | OBS Meta data Operations | Average Time for Calling the OBS Metadata API | Average time for calling the OBS metadata API | 500ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Success Rate for Calling the OBS Metadata API | Success rate for calling the OBS metadata API | 99.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | OBS data read operation | Success Rate for Calling the OBS Data Read API | Success rate for calling the OBS data read API | 99.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | Ranger | GC | UserSync GC Duration | UserSync garbage collection (GC) duration | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | RangerAdmin GC Duration | RangerAdmin GC duration | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | TagSync GC Duration | TagSync GC duration | 12000 ms | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | CPU & memory | UserSync Non-Heap Memory Usage | UserSync non-heap memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | UserSync Direct Memory Usage | UserSync direct memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | UserSync Heap Memory Usage | UserSync heap memory usage | 95.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | RangerAdmin Non-Heap Memory Usage | RangerAdmin non-heap memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | RangerAdmin Heap Memory Usage | RangerAdmin heap memory usage | 95.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | RangerAdmin Direct Memory Usage | RangerAdmin direct memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | TagSync Direct Memory Usage | TagSync direct memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | TagSync Non-Heap Memory Usage | TagSync non-heap memory usage | 80.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | TagSync Heap Memory Usage | TagSync heap memory usage | 95.0% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | ClickHouse | Cluster Quota | Clickhouse service quantity quota usage in ZooKeeper | Quota of the ZooKeeper nodes used by a ClickHouse service | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + | | | Capacity quota usage of the Clickhouse service in ZooKeeper | Capacity quota of ZooKeeper directory used by the ClickHouse service | 90% | + +------------+---------------------------------+------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------+ + +.. |image1| image:: /_static/images/en-us_image_0263899498.png +.. |image2| image:: /_static/images/en-us_image_0263899452.png +.. |image3| image:: /_static/images/en-us_image_0263899582.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/index.rst new file mode 100644 index 0000000..8ffe2e7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000069.html + +.. _admin_guide_000069: + +Alarms +====== + +- :ref:`Overview of Alarms and Events ` +- :ref:`Configuring the Threshold ` +- :ref:`Configuring the Alarm Masking Status ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview_of_alarms_and_events + configuring_the_threshold + configuring_the_alarm_masking_status diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/overview_of_alarms_and_events.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/overview_of_alarms_and_events.rst new file mode 100644 index 0000000..87a7d1c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/alarms/overview_of_alarms_and_events.rst @@ -0,0 +1,122 @@ +:original_name: admin_guide_000070.html + +.. _admin_guide_000070: + +Overview of Alarms and Events +============================= + +Alarms +------ + +Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Alarms**. You can view information about alarms reported by all clusters, including the alarm name, ID, severity, and generation time. By default, the latest 10 alarms are displayed on each page. + +You can click |image1| on the left of an alarm to view detailed alarm parameters. :ref:`Table 1 ` describes the parameters. + +.. _admin_guide_000070__table19183175495311: + +.. table:: **Table 1** Alarm parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================+ + | Alarm ID | Alarm ID | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Alam Name | Alarm name | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Severity | Alarm severity. Value options are **Critical**, **Major**, **Minor**, and **Suggestion**. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Generated | Time when an alarm is generated | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Cleared | Time when an alarm is cleared. If the alarm is not cleared, **--** is displayed. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Source | Cluster name | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Object | Service, process, or module that triggers the alarm | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Automatically Cleared | Whether the alarm can be automatically cleared after the fault is rectified | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Alarm Status | Current status of the alarm. Value options are **Auto**, **Manual**, and **Uncleared**. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Alarm Cause | Indicates the possible cause of an alarm. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Serial Number | Indicates the number of alarms generated by the system. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Additional Information | Indicates the error information. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Location | Detailed information for locating the alarm, which includes the following: | + | | | + | | - **Source**: cluster for which the alarm is generated | + | | - **ServiceName**: service for which the alarm is generated | + | | - **RoleName**: role for which the alarm is generated | + | | - **HostName**: host for which the alarm is generated | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + +**Manage alarms.** + +- Click **Export All** to export all alarm details. +- If multiple alarms have been handled, you can select one or more alarms to be cleared and click **Clear Alarm** to clear the alarms in batches. A maximum of 300 alarms can be cleared in each batch. +- You can click |image2| to manually refresh the current page and click |image3| to filter columns to display. +- You can filter alarms by object or cluster. +- You can click **Advanced Search** to search for alarms by alarm ID, name, type, severity, start time, or end time. Click **Search** to filter alarms that meet the search criteria. Click **Advanced Search** again to view the number of search criteria that you have configured. +- You can click **Clear**, **Mask**, or **View Help** to perform corresponding operations on an alarm. +- If there are a large number of alarms, you can click **View by Category** to sort uncleared alarms by alarm ID. After alarms are classified, click the number of uncleared alarms to view alarm details. + +Events +------ + +Log in to FusionInsight Manager and choose **O&M** > **Alarm** > **Events**. On the **Events** page that is displayed, you can view information about all events in the cluster, including the event name, ID, severity, generation time, object, and location. By default, the latest 10 events are displayed on each page. + + +.. figure:: /_static/images/en-us_image_0263899271.png + :alt: **Figure 1** Events page + + **Figure 1** Events page + +You can click |image4| on the left of an event to view detailed event parameters. :ref:`Table 2 ` describes the parameters. + +.. _admin_guide_000070__table13671201172912: + +.. table:: **Table 2** Event parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================+ + | Event ID | Event ID | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Event Name | Event name | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Severity | Event severity. Value options are **Critical**, **Major**, **Minor**, and **Suggestion**. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Generated | Time when an event is generated | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Object | Object for which the event may be generated | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Serial Number | Number of the event generated by the system | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Location | Detailed information for locating the event, which includes the following: | + | | | + | | - **Source**: cluster for which the event is generated | + | | - **ServiceName**: service for which the event is generated | + | | - **RoleName**: role for which the event is generated | + | | - **HostName**: host for which the event is generated | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Additional Information | Indicates the error information. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Event Cause | Indicates the possible cause of an event. | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + | Source | Cluster name | + +-----------------------------------+-------------------------------------------------------------------------------------------+ + +**Manage events.** + +- Click **Export All** to export all event details. +- You can click |image5| to manually refresh the current page and click |image6| to filter columns to display. +- You can filter events by object or cluster. +- You can click **Advanced Search** to search for events by event ID, name, severity, start time, or end time. + +.. |image1| image:: /_static/images/en-us_image_0263899504.png +.. |image2| image:: /_static/images/en-us_image_0263899662.png +.. |image3| image:: /_static/images/en-us_image_0263899597.png +.. |image4| image:: /_static/images/en-us_image_0263899384.png +.. |image5| image:: /_static/images/en-us_image_0263899662.png +.. |image6| image:: /_static/images/en-us_image_0000001318636944.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_restoration_task.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_restoration_task.rst new file mode 100644 index 0000000..bb1ceb7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_restoration_task.rst @@ -0,0 +1,33 @@ +:original_name: admin_guide_000082.html + +.. _admin_guide_000082: + +Creating a Backup Restoration Task +================================== + +Scenario +-------- + +You can create a backup restoration task on FusionInsight Manager. After the restoration task is executed, the specified backup data is restored to the cluster. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Backup and Restoration** > **Restoration Management**. On the page that is displayed, click **Create**. + +#. Configure **Task Name**. + +#. Set **Recovery Object** to **OMS** or the cluster whose data you want to restore. + +#. Set the required parameters in the **Recovery Configuration** area. + + - Metadata and service data can be restored. + - For details about how to restore data of different components, see :ref:`Backup and Recovery Management `. + +#. Click **OK** to save the configurations. + +#. In the restoration task list, you can view the created restoration tasks. + + Locate the row containing the target restoration task, click **Start** in the **Operation** column to execute the restoration task immediately. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_task.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_task.rst new file mode 100644 index 0000000..bc8a397 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/creating_a_backup_task.rst @@ -0,0 +1,53 @@ +:original_name: admin_guide_000081.html + +.. _admin_guide_000081: + +Creating a Backup Task +====================== + +Scenario +-------- + +You can create backup tasks on FusionInsight Manager. Executing backup tasks backs up related data. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Backup and Restoration** > **Backup Management**. On the page that is displayed, click **Create**. + +#. Set **Backup Object** to **OMS** or the cluster whose data you want to back up. + +#. Enter a task name in the **Name** text box. + +#. Set **Mode** to **Periodic** or **Manual** as required. + + .. table:: **Table 1** Backup types + + +-----------------------+-----------------------+-------------------------------------------------------------------------------+ + | Type | Parameter | Description | + +=======================+=======================+===============================================================================+ + | Periodic backup | Start Time | Indicates the time when a periodic backup task is started for the first time. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------+ + | | Period | Task execution interval. Value options are **Hours** and **Days**. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------+ + | | Backup Policy | The following policies can be selected: | + | | | | + | | | - Full backup at the first time and subsequent incremental backup | + | | | - Full backup every time | + | | | - Full backup once every n times | + +-----------------------+-----------------------+-------------------------------------------------------------------------------+ + | Manual backup | N/A | You need to manually execute the task to back up data. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------+ + +#. Set required parameters in the **Configuration** area. + + - Metadata and service data can be backed up. + - For details about how to back up data of different components, see :ref:`Backup and Recovery Management `. + +#. Click **OK** to save the configurations. + +#. In the backup task list, you can view the created backup task. + + Locate the row that contains the target backup task, choose **More** > **Back Up Now** in the **Operation** column to execute the task immediately. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/index.rst new file mode 100644 index 0000000..18e7842 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000080.html + +.. _admin_guide_000080: + +Configuring Backup and Backup Restoration +========================================= + +- :ref:`Creating a Backup Task ` +- :ref:`Creating a Backup Restoration Task ` +- :ref:`Managing Backup and Backup Restoration Tasks ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_a_backup_task + creating_a_backup_restoration_task + managing_backup_and_backup_restoration_tasks diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/managing_backup_and_backup_restoration_tasks.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/managing_backup_and_backup_restoration_tasks.rst new file mode 100644 index 0000000..868759a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/configuring_backup_and_backup_restoration/managing_backup_and_backup_restoration_tasks.rst @@ -0,0 +1,44 @@ +:original_name: admin_guide_000083.html + +.. _admin_guide_000083: + +Managing Backup and Backup Restoration Tasks +============================================ + +Scenario +-------- + +You can also maintain and manage backup restoration tasks on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **O&M** > **Backup and Restoration** > **Backup Management** or **Restoration Management**. +#. In the **Operation** column of the specified task in the task list, select the operation to be performed. + + .. table:: **Table 1** Maintenance and management operations + + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | Operation Entry | Description | + +=================================================+=============================================================================================================+ + | **Config** | Modify parameters for the backup task. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **Recover** | After some service data is successfully backed up, you can use this function to quickly restore data. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **Back Up Now** | Perform this operation to execute the backup task immediately. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **Stop** | Perform this operation to stop a running task. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **Delete** or **Delete** | This operation is used to delete tasks. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **Suspend** | Perform this operation to disable the automatic backup task function. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **Resume** | Perform this operation to enable the automatic backup task function. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **More** > **View History** or **View History** | Perform this operation to switch to the task run log page to view the task running details and backup path. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **View** | Perform this operation to check the parameter settings of the restoration task. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ + | **Start** | Perform this operation to run the restoration task. | + +-------------------------------------------------+-------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/index.rst new file mode 100644 index 0000000..b8c8a34 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000068.html + +.. _admin_guide_000068: + +O&M +=== + +- :ref:`Alarms ` +- :ref:`Log ` +- :ref:`Perform a Health Check ` +- :ref:`Configuring Backup and Backup Restoration ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + alarms/index + log/index + perform_a_health_check/index + configuring_backup_and_backup_restoration/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/index.rst new file mode 100644 index 0000000..8267613 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000073.html + +.. _admin_guide_000073: + +Log +=== + +- :ref:`Log Online Search ` +- :ref:`Log Download ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + log_online_search + log_download diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_download.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_download.rst new file mode 100644 index 0000000..22d272f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_download.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000075.html + +.. _admin_guide_000075: + +Log Download +============ + +Scenario +-------- + +FusionInsight Manager allows you to batch export logs generated on all instances of each service. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Log** > **Download**. + +#. Select a log download range: + + a. **Service**: Click |image1| and select a service. + b. **Host**: Enter the IP address of the host where the service is deployed. You can also click |image2| to select the required host. + c. Click |image3| in the upper right corner and configure **Start Time** and **End Time**. + +#. Click **Download**. + + The downloaded log package contains the topology information of the start time and end time, helping you quickly find the log you need. + + The topology file is named in the format of **topo_<**\ *Topology structure change time*\ **>.txt**. The file contains the node IP address, host name, and service instances that reside on the node. (OMS nodes are identified by **Manager:Manager**.) + + Example: + + .. code-block:: + + 192.168.204.124|suse-124|DBService:DBServer;KrbClient:KerberosClient;LdapClient:SlapdClient;LdapServer:SlapdServer;Manager:Manager;meta:meta + +.. |image1| image:: /_static/images/en-us_image_0263899401.png +.. |image2| image:: /_static/images/en-us_image_0263899334.png +.. |image3| image:: /_static/images/en-us_image_0000001369965781.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_online_search.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_online_search.rst new file mode 100644 index 0000000..55665cc --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/log/log_online_search.rst @@ -0,0 +1,105 @@ +:original_name: admin_guide_000074.html + +.. _admin_guide_000074: + +Log Online Search +================= + +Scenario +-------- + +FusionInsight Manager allows you to search for logs online and view the log content of components to locate faults. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Log** > **Online Search**. + +#. Configure the parameters listed in :ref:`Table 1 ` to search for the logs you need. You can select a default log search duration (including **0.5h**, **1h**, **2h**, **6h**, **12h**, **1d**, **1w**, and **1m**), or click |image1| to customize **Start Data** and **End Data**. + + .. _admin_guide_000074__table14922145885914: + + .. table:: **Table 1** Log search parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================+ + | Search Content | Keywords or regular expression to be searched for | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Service or module for which you want to query logs | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | File | Log files to be searched for when only one role is selected | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Lowest Log Level | Lowest level of logs to be queried. After you select a level, the logs of this level and higher levels are displayed. | + | | | + | | The levels in ascending order are as follows: | + | | | + | | TRACE < DEBUG < INFO < WARN < ERROR < FATAL | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Host Scope | - You can click |image2| to select hosts. | + | | - Enter the host name of the node for which you want to query logs or the IP address of the management plane. | + | | - Use commas (,) to separate IP addresses, for example, **192.168.10.10**,\ **192.168.10.11**. | + | | - Use hyphens (-) to indicate an IP address segment if the IP addresses are consecutive, for example, **192.168.10.[10-20]**. | + | | - Use hyphens (-) to indicate an IP address segment if the IP addresses are consecutive, and use commas (,) to separate IP address segments, for example, **192.168.10.[10-20,30-40]**. | + | | | + | | .. note:: | + | | | + | | - If this parameter is not specified, all hosts are selected by default. | + | | - A maximum of 10 expressions can be entered at a time. | + | | - A maximum of 2,000 hosts can be matched for all entered expressions at a time. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Advanced Configurations | - **Max Quantity**: maximum number of logs that can be displayed at a time. If the number of queried logs exceeds the value of this parameter, the earliest logs will be ignored. If this parameter is not set, the maximum number of logs that can be displayed at a time is not limited. | + | | - **Timeout Duration**: log query timeout duration. This parameter is used to limit the maximum log query time on each node. When the query times out, the query is stopped and the logs that have been searched for are still displayed. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **Search**. :ref:`Table 2 ` describes the fields in search results. + + .. _admin_guide_000074__table92081419119: + + .. table:: **Table 2** Parameters in search results + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================================================================================================================================+ + | Time | Time when a line of log is generated | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Source Cluster | Cluster for which the log is generated | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Host Name | Host name of the node where the log file recording the line of log is located | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Location | Path of the log file recording the line of log | + | | | + | | Click the location information to go to the online log browsing page. By default, 100 lines of logs before and 100 lines after the line of log are displayed. You can click **Load More** on the top or bottom of the page to view more logs. Click **Download** to download the log file to the local PC. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Line No. | Line number of a line of log in the log file | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Level | Level of the line of log | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log | Log content | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + You can click **Stop** to forcibly stop the search. You can view the search results in the list. + +#. Click **Filter** to filter the logs to display on the page. :ref:`Table 3 ` lists the fields that you can use to filter logs. After you configure these parameters, click **Filter** to search for logs meeting the search criteria. You can click **Reset** to clear the information that you have filled in. + + .. _admin_guide_000074__table5795173012197: + + .. table:: **Table 3** Parameters for filtering logs + + ============== ========================================== + Parameter Description + ============== ========================================== + Keywords Keywords of the losg to be searched for + Host Name Name of the host to be searched for + Location Path of the log file to be searched for + Started Start time for logs to be searched for + Completed End time for logs to be searched for + Source Cluster Cluster for which logs need to be searched + ============== ========================================== + +.. |image1| image:: /_static/images/en-us_image_0000001369765661.png +.. |image2| image:: /_static/images/en-us_image_0263899392.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/index.rst new file mode 100644 index 0000000..ac414d1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000076.html + +.. _admin_guide_000076: + +Perform a Health Check +====================== + +- :ref:`Viewing a Health Check Task ` +- :ref:`Managing Health Check Reports ` +- :ref:`Modifying Health Check Configuration ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_a_health_check_task + managing_health_check_reports + modifying_health_check_configuration diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/managing_health_check_reports.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/managing_health_check_reports.rst new file mode 100644 index 0000000..a1625ed --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/managing_health_check_reports.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000078.html + +.. _admin_guide_000078: + +Managing Health Check Reports +============================= + +Scenario +-------- + +FusionInsight Manager allows you to download and delete health check reports. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **O&M** > **Health Check**. +#. Locate the row containing the target health check report and click **Export Report** in the **Operation** to download the report. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/modifying_health_check_configuration.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/modifying_health_check_configuration.rst new file mode 100644 index 0000000..bb0f0ec --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/modifying_health_check_configuration.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000079.html + +.. _admin_guide_000079: + +Modifying Health Check Configuration +==================================== + +Scenario +-------- + +Administrators can enable automatic health check to reduce manual operation time. By default, the automatic health check checks the entire cluster. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Health Check** > **Configuration**. + + **Periodic Health Check** indicates whether to enable automatic health check. Selecting **Enable** to enable the automatic health check, and selecting **Disable** to disable the function. + + Set the health check period to **Daily**, **Weekly**, or **Monthly** as required. + +#. Click **OK** to save the configurations. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/viewing_a_health_check_task.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/viewing_a_health_check_task.rst new file mode 100644 index 0000000..fb9c4cc --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/o&m/perform_a_health_check/viewing_a_health_check_task.rst @@ -0,0 +1,46 @@ +:original_name: admin_guide_000077.html + +.. _admin_guide_000077: + +Viewing a Health Check Task +=========================== + +Scenario +-------- + +Administrators can view all health check tasks in the health check management center to check whether the cluster is affected after the modification. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **O&M** > **Health Check**. + + By default, all saved health check reports are listed. The parameters for a health check report are as follows: + + .. table:: **Table 1** Parametes for a health check report + + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==============+===========================================================================================================================================================================================+ + | Check Object | Object to be checked. You can expand the list to view its details. | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Check result status. Value options are **No problems found**, **Problems found**, and **Checking**. | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Check Type | Entity on which the check is to be performed. Value options are **System**, **Cluster**, **Host**, **Service**, and **OMS**. If you select **Cluster**, all items are checked by default. | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Start Mode | Whether the health check is automatically or manually performed | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Started | Start time of the check | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Completed | End time of the check | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Operations you can perform. Value options are **Export Report** and **View Help**. | + +--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + - In the upper right corner of the check list, you can filter health checks by check type or status. + - If **Check Type** is **Cluster**, **View Help** is displayed in the **Check Object** drop-down list. + - During a health check, the system determines whether check objects are healthy based on their historical monitoring metric data. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/enabling_and_disabling_permission_verification_on_cluster_components.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/enabling_and_disabling_permission_verification_on_cluster_components.rst new file mode 100644 index 0000000..26fdde6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/enabling_and_disabling_permission_verification_on_cluster_components.rst @@ -0,0 +1,98 @@ +:original_name: admin_guide_000247.html + +.. _admin_guide_000247: + +Enabling and Disabling Permission Verification on Cluster Components +==================================================================== + +Scenario +-------- + +HDFS and ZooKeeper verify the permission of users who attempt to access the services in both security and normal clusters by default. Users without related permission cannot access resources in HDFS and ZooKeeper. When the cluster is deployed in normal mode, HBase and YARN do not verify the permission of users who attempt to access the services by default. All users can access resources in HBase and YARN. + +Based on actual service requirements, administrators can enable permission verification on HBase and YARN or disable permission verification on HDFS and ZooKeeper in normal clusters. + +Impact on the System +-------------------- + +After the enabling and disabling operations, the service configuration will expire. You need to restart the corresponding service for the configuration to take effect. + +Enabling Permission Verification on HBase +----------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **Ranger**, and click **Configurations**. + +#. Click **All Configurations**. + +#. Search for parameters **hbase.coprocessor.region.classes**, **hbase.coprocessor.master.classes**, and **hbase.coprocessor.regionserver.classes**. + + Add the coprocessor parameter **org.apache.hadoop.hbase.security.access.AccessController** to the end of the values of the preceding parameters, and use a comma (,) to separate the values from those of the original coprocessors. + +#. Click **Save**, click **OK**, and wait for message "Operation successful" to display. + +Disabling Permission Verification on HBase +------------------------------------------ + +.. note:: + + After HBase permission verification is disabled, the existing permission data will be retained. If you want to delete permission information, disable permission verification, enter the HBase shell, and delete table **hbase:acl**. + +#. Log in to FusionInsight Manager. + +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **HBase**, and click **Configurations**. + +#. Click **All Configurations**. + +#. Search for parameters **hbase.coprocessor.region.classes**, **hbase.coprocessor.master.classes**, and **hbase.coprocessor.regionserver.classes**. + + Delete the coprocessor parameter **org.apache.hadoop.hbase.security.access.AccessController**. + +#. Click **Save**, click **OK**, and wait for message "Operation successful" to display. + +Disabling Permission Verification on HDFS +----------------------------------------- + +#. Log in to FusionInsight Manager. +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **HDFS**, and click **Configurations**. +#. Click **All Configurations**. +#. Search for parameters **dfs.namenode.acls.enabled** and **dfs.permissions.enabled**. + + - **dfs.namenode.acls.enabled** indicates whether to enable HDFS ACL. The default value is **true**, indicating that the ACL is enabled. Change the value to **false**. + - **dfs.permissions.enabled** indicates whether to enable permission check for HDFS. The default value is **true**, indicating that permission check is enabled. Change the value to **false**. After the modification, the owner, owner group, and permission of the directories and files in HDFS remain unchanged. + +#. Click **Save**, click **OK**, and wait for message "Operation successful" to display. + +Enabling Permission Verification on YARN +---------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **Yarn**, and click **Configurations**. + +#. Click **All Configurations**. + +#. Search for parameter **yarn.acl.enable**. + + **yarn.acl.enable** indicates whether to enable the permission check for YARN. + + - In normal clusters, the value is set to **false** by default to disable permission check. To enable permission check, change the value to **true**. + - In security clusters, the value is set to **true** by default to enable authentication. + +#. Click **Save**, click **OK**, and wait for message "Operation successful" to display. + +Disabling Permission Verification on ZooKeeper +---------------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **ZooKeeper**, and click **Configurations**. + +#. Click **All Configurations**. + +#. Search for parameter **skipACL**. + + **skipACL** indicates whether to skip the ZooKeeper permission check. The default value is **no**, indicating that permission check is enabled. Change the value to **yes**. + +#. Click **Save**, click **OK**, and wait for message "Operation successful" to display. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/index.rst new file mode 100644 index 0000000..cb3931e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000243.html + +.. _admin_guide_000243: + +Account Security Settings +========================= + +- :ref:`Unlocking LDAP Users and Management Accounts ` +- :ref:`Internal an Internal System User ` +- :ref:`Enabling and Disabling Permission Verification on Cluster Components ` +- :ref:`Logging In to a Non-Cluster Node Using a Cluster User in Normal Mode ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + unlocking_ldap_users_and_management_accounts + internal_an_internal_system_user + enabling_and_disabling_permission_verification_on_cluster_components + logging_in_to_a_non-cluster_node_using_a_cluster_user_in_normal_mode diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/internal_an_internal_system_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/internal_an_internal_system_user.rst new file mode 100644 index 0000000..4689dbd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/internal_an_internal_system_user.rst @@ -0,0 +1,60 @@ +:original_name: admin_guide_000246.html + +.. _admin_guide_000246: + +Internal an Internal System User +================================ + +Scenario +-------- + +If the service is abnormal, the internal user of the system may be locked. Unlock the user promptly, or the cluster cannot run properly. For the list of system internal users, see :ref:`User Account List ` in . The internal user of the system cannot be unlocked using FusionInsight Manager. + +Prerequisites +------------- + +Obtain the default password of the LDAP administrator **cn=root,dc=hadoop,dc=com** by referring to :ref:`User Account List ` in . + +Procedure +--------- + +#. Use the following method to confirm whether the internal system username is locked: + + a. OLdap port number obtaining method: + + #. Log in to FusionInsight Manager, choose **System** > **OMS** > **oldap** > **Modify Configuration**. + #. The **LDAP Listening Port** parameter value is **oldap port**. + + b. Domain name obtaining method: + + #. Log in to FusionInsight Manager, choose **System** > **Permission** > **Domain and Mutual Trust**. + + #. The **Local Domain** parameter value is the domain name. + + For example, the domain name of the current system is **9427068F-6EFA-4833-B43E-60CB641E5B6C.COM**. + + c. Run the following command on each node in the cluster as user **omm** to query the number of password authentication failures: + + **ldapsearch -H ldaps://**\ *OMS Floating IP Address*\ **:**\ *OLdap port* **-LLL -x -D cn=root,dc=hadoop,dc=com -b krbPrincipalName=**\ *Internal system username*\ **@**\ *Domain name*\ **,cn=**\ *Domain name*\ **,cn=krbcontainer,dc=hadoop,dc=com -w** *Password of LDAP administrator* **-e ppolicy \| grep krbLoginFailedCount** + + For example, run the following command to check the number of password authentication failures for user **oms/manager**: + + **ldapsearch -H ldaps://10.5.146.118:21750 -LLL -x -D cn=root,dc=hadoop,dc=com -b krbPrincipalName=oms/manager@9427068F-6EFA-4833-B43E-60CB641E5B6C.COM,cn=9427068F-6EFA-4833-B43E-60CB641E5B6C.COM,cn=krbcontainer,dc=hadoop,dc=com -w** *Password of user cn=root,dc=hadoop,dc=com* **-e ppolicy \| grep krbLoginFailedCount** + + .. code-block:: + + krbLoginFailedCount: 5 + + d. Log in to FusionInsight Manager, choose **System** > **Permission** > **Security Policy** > **Password Policy**. + + e. Check the value of the **Password Retries** parameter. If the value is less than or equal to the value of **krbLoginFailedCount**, the user is locked. + + .. note:: + + You can also check whether internal users are locked by viewing operations logs. + +#. Log in to the active management node as user **omm** and run the following command to unlock the user: + + **sh ${BIGDATA_HOME}/om-server/om/share/om/acs/config/unlockuser.sh --userName** *Internal system username* + + Example: **sh ${BIGDATA_HOME}/om-server/om/share/om/acs/config/unlockuser.sh --userName oms/manager** diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/logging_in_to_a_non-cluster_node_using_a_cluster_user_in_normal_mode.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/logging_in_to_a_non-cluster_node_using_a_cluster_user_in_normal_mode.rst new file mode 100644 index 0000000..26b3c84 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/logging_in_to_a_non-cluster_node_using_a_cluster_user_in_normal_mode.rst @@ -0,0 +1,93 @@ +:original_name: admin_guide_000248.html + +.. _admin_guide_000248: + +Logging In to a Non-Cluster Node Using a Cluster User in Normal Mode +==================================================================== + +Scenario +-------- + +When the cluster is installed in normal mode, the component clients do not support security authentication and cannot use the **kinit** command. Therefore, nodes outside the cluster cannot use users in the cluster by default. This may result in a user authentication failure when one of these nodes access a component server. + +The node administrator can configure a user who has the same name as that of a user for a node outside the cluster, allow the user to log in to the node using the SSH protocol, and connect to the servers of components in the cluster by using the user who logs in to the OS. + +Prerequisites +------------- + +- Nodes outside the cluster can connect to the service plane of the cluster. +- The KrbServer service of the cluster is running properly. +- You have obtained the password of user **root** of the node outside the cluster. +- A human-machine user has been planned and added to the cluster, and you have obtained the authentication credential file. For details, see :ref:`Creating a User ` and :ref:`Exporting an Authentication Credential File `. + +Procedure +--------- + +#. Log in to the node where a user is to be added as user **root**. + +#. Run the following command: + + **rpm -qa \| grep pam** and **rpm -qa\| grep krb5-client** + + The following RPM packages are displayed: + + .. code-block:: + + pam_krb5-32bit-2.3.1-47.12.1 + pam-modules-32bit-11-1.22.1 + yast2-pam-2.17.3-0.5.211 + pam-32bit-1.1.5-0.10.17 + pam_mount-32bit-0.47-13.16.1 + pam-config-0.79-2.5.58 + pam_krb5-2.3.1-47.12.1 + pam-doc-1.1.5-0.10.17 + pam-modules-11-1.22.1 + pam_mount-0.47-13.16.1 + pam_ldap-184-147.20 + pam-1.1.5-0.10.17 + krb5-client-1.6.3 + +#. Check whether the RPM packages in the list are installed in the OS. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _admin_guide_000248__en-us_topic_0046736680_inst_kerb: + + Obtain the lacked RPM packages from the OS image, upload the files to the current directory, and run the following command to install the RPM package: + + **rpm -ivh \*.rpm** + + .. note:: + + The RPM packages to be installed may bring security risks. The risks that may be brought by the installation of these RPM packages must be taken into consideration during OS hardening. + + After the RPM packages are installed, go to :ref:`5 `. + +#. .. _admin_guide_000248__en-us_topic_0046736680_conf_kerb: + + Run the following command to configure Kerberos authentication on PAM: + + **pam-config --add --krb5** + + .. note:: + + If you need to cancel Kerberos authentication and system user login on a non-cluster node, run the **pam-config --delete --krb5** command as user **root**. + +#. Decompress the authentication credential file to obtain **krb5.conf**, use WinSCP to upload this configuration file to the **/etc** directory on the node outside the cluster, and run the following command to configure related permission to enable other users to access the file, such as permission **604**: + + **chmod 604 /etc/krb5.conf** + +#. Run the following command in the connection session as user **root** to add the corresponding OS user to the human-machine user, and specify **root** as the primary group. + + The OS user password is the same as the initial password when the human-machine user is created on Manager. + + **useradd** *User name* **-m -d /home/admin_test -g root -s /bin/bash** + + For example, if the name of the human-machine user is **admin_test**, run the following command: + + **useradd admin_test -m -d /home/admin_test -g root -s /bin/bash** + + .. note:: + + When you use the newly added OS user to log in to the node by using the SSH protocol for the first time, the system prompts that the password has expired after you enter the user password, and the system prompts that the password needs to be changed after you enter the user password again. You need to enter a new password that meets the password complexity requirements of both the node OS and the cluster. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/unlocking_ldap_users_and_management_accounts.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/unlocking_ldap_users_and_management_accounts.rst new file mode 100644 index 0000000..18d080d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/account_security_settings/unlocking_ldap_users_and_management_accounts.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000245.html + +.. _admin_guide_000245: + +Unlocking LDAP Users and Management Accounts +============================================ + +Scenario +-------- + +If the LDAP user **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com** and LDAP management accounts **cn=krbkdc,ou=Users,dc=hadoop,dc=com** and **cn=krbadmin,ou=Users,dc=hadoop,dc=com** are locked, the administrator must unlock these accounts. + +.. note:: + + If you input an incorrect password for the LDAP user or management account for five consecutive times, the LDAP user or management account is locked. The account is automatically unlocked after 5 minutes. + +Procedure +--------- + +#. Log in to the active management node as user **omm**. + +#. Run the following command to go to the related directory: + + **cd ${BIGDATA_HOME}/om-server/om/ldapserver/ldapserver/local/script** + +#. Run the following command to unlock the LDAP user or management account: + + **./ldapserver_unlockUsers.sh** *USER_NAME* + + In the command, *USER_NAME* indicates the name of the user to be unlocked. + + For example, to unlock the LDAP management **account cn=krbkdc,ou=Users,dc=hadoop,dc=com**, run the following command: + + **./ldapserver_unlockUsers.sh krbkdc** + + After the script is executed, enter the password of user **krbkdc** after **ROOT_DN_PASSWORD**. If the following information is displayed, the account is successfully unlocked. + + .. code-block:: + + Unlock user krbkdc successfully. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_a_component_database_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_a_component_database_user.rst new file mode 100644 index 0000000..da7b9b2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_a_component_database_user.rst @@ -0,0 +1,46 @@ +:original_name: admin_guide_000261.html + +.. _admin_guide_000261: + +Changing the Password for a Component Database User +=================================================== + +Scenario +-------- + +It is recommended that the administrator periodically change the password for each component database user to improve the system O&M security. + +.. note:: + + This section applies only to MRS 3.1.0. For versions later than MRS 3.1.0, see :ref:`Resetting the Component Database User Password `. + +Impact on the System +-------------------- + +The services need to be restarted for the new password to take effect. The services are unavailable during the restart. + +Procedure +--------- + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and click **Services**. + +#. Determine the component database user whose password is to be changed. + + For details about how to change the password of database user **omm** of DBService, perform operations in :ref:`Changing the Password for User omm in DBService `. To change the passwords of database users of other components, you need to stop services first and then perform the operations in :ref:`3 `. + +#. .. _admin_guide_000261__li12756790114651: + + Click the service whose database user password is to be changed, and choose **More** > **Change Database Password**. On the displayed page, enter the password of the current login user and click **OK**. + +#. Enter the old and new passwords as prompted. + + The password complexity requirements are as follows: + + - The database user password contains 8 to 32 characters. + - The password contains at least three types of the following: uppercase letters, lowercase letters, digits, and special characters which can only be (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be the same as the last 20 historical passwords. + +#. Select "I have read the information and understand the impact" and click **OK**. + +#. After the password is changed, choose **More** > **Restart Service**. In the displayed dialog box, enter the password of the current login user, click **OK**, and select **Restart the upper-layer services**. Click **OK** to restart the services. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_the_data_access_user_of_the_oms_database.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_the_data_access_user_of_the_oms_database.rst new file mode 100644 index 0000000..142e6dd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_the_data_access_user_of_the_oms_database.rst @@ -0,0 +1,42 @@ +:original_name: admin_guide_000260.html + +.. _admin_guide_000260: + +Changing the Password for the Data Access User of the OMS Database +================================================================== + +Scenario +-------- + +It is recommended that the administrator periodically change the password of the user accessing the OMS database to improve the system O&M security. + +Impact on the System +-------------------- + +The OMS service needs to be restarted for the new password to take effect. The service is unavailable during the restart. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System** > **OMS** > **gaussDB** > **Change Password**. + +#. Locate the row where user **omm** is located and click **Change Password** in the **Operation** column. + +#. In the displayed window, enter the password of the current login user and click **OK**. + +#. Enter the old and new passwords as prompted. + + The password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters which can only be (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be the same as the last 20 historical passwords. + +#. Click **OK**. Wait until the system displays a message indicating that the operation is successful. + +#. Locate the row where user **omm** is located and click **Restart OMS Service** in the **Operation** column. + +#. In the displayed window, enter the password of the current login user and click **OK**. + +#. In the displayed restart confirmation dialog box, click **OK** to restart the OMS service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_user_omm_in_dbservice.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_user_omm_in_dbservice.rst new file mode 100644 index 0000000..5feb0eb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_for_user_omm_in_dbservice.rst @@ -0,0 +1,41 @@ +:original_name: admin_guide_000354.html + +.. _admin_guide_000354: + +Changing the Password for User omm in DBService +=============================================== + +#. Log in to the active DBService node as user **root**. + + .. note:: + + The password of user **omm** for the DBService database cannot be changed on the standby DBService node. Change the password on the active DBService node only. + +#. Run the following command to switch to another user: + + **su - omm** + +#. Run the following command to go to the related directory: + + **source $DBSERVER_HOME/.dbservice_profile** + + **cd** **${DBSERVICE_SOFTWARE_DIR}/\ sbin/** + +#. Run the following command to change the password of user **omm**: + + **sh modifyDBPwd.sh** + +#. Enter the old password of user **omm** and enter a new password twice. + + The password complexity requirements are as follows: + + - The password contains 8 to 32 characters. + - The password contains at least three types of the following: uppercase letters, lowercase letters, digits, and special characters which can only be (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be the same as the last 20 historical passwords. + + If the following information is displayed, the password is changed successfully. + + .. code-block:: + + Successful to modify password. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_of_the_oms_database_administrator.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_of_the_oms_database_administrator.rst new file mode 100644 index 0000000..cd53dc1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/changing_the_password_of_the_oms_database_administrator.rst @@ -0,0 +1,47 @@ +:original_name: admin_guide_000259.html + +.. _admin_guide_000259: + +Changing the Password of the OMS Database Administrator +======================================================= + +Scenario +-------- + +It is recommended that the administrator periodically change the password of the OMS database administrator to improve the system O&M security. + +Procedure +--------- + +#. Log in to the active management node as user **root**. + + .. note:: + + The password of user **ommdba** cannot be changed on the standby management node. Otherwise, the cluster may not work properly. Change the password on the active management node only. + +#. Run the following command to switch to another user: + + **su - omm** + +#. Run the following command to go to the related directory: + + **cd $OMS_RUN_PATH/tools** + +#. Run the following command to change the password for user **ommdba**: + + **mod_db_passwd ommdba** + +#. Enter the old password of user **ommdba** and enter a new password twice. + + The password complexity requirements are as follows: + + - The password contains 16 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters which can only be (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be the same as the last 20 historical passwords. + + If the following information is displayed, the password is changed successfully. + + .. code-block:: + + Congratulations, update [ommdba] password successfully. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/index.rst new file mode 100644 index 0000000..972621a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000258.html + +.. _admin_guide_000258: + +Changing the Password for a Database User +========================================= + +- :ref:`Changing the Password of the OMS Database Administrator ` +- :ref:`Changing the Password for the Data Access User of the OMS Database ` +- :ref:`Changing the Password for a Component Database User ` +- :ref:`Resetting the Component Database User Password ` +- :ref:`Changing the Password for User omm in DBService ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + changing_the_password_of_the_oms_database_administrator + changing_the_password_for_the_data_access_user_of_the_oms_database + changing_the_password_for_a_component_database_user + resetting_the_component_database_user_password + changing_the_password_for_user_omm_in_dbservice diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/resetting_the_component_database_user_password.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/resetting_the_component_database_user_password.rst new file mode 100644 index 0000000..39299a3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_database_user/resetting_the_component_database_user_password.rst @@ -0,0 +1,41 @@ +:original_name: admin_guide_000363.html + +.. _admin_guide_000363: + +Resetting the Component Database User Password +============================================== + +Scenario +-------- + +Default passwords for components in the MRS cluster to connect to the DBService database are random. You are advised to periodically reset the passwords of component database users to improve system O&M security. + +.. note:: + + This section applies only to MRS 3.1.2 or later. For versions earlier than MRS 3.1.2, see :ref:`Changing the Password for a Component Database User `. + +Impact on the System +-------------------- + +To reset passwords, you need to stop and then restart services, during which services are unavailable. + +Procedure +--------- + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and click **Services**. + +#. Click the name of the service whose database user password is to be reset, for example, **Kafka**, and click **Stop Service** on the **Dashboard** page. + + In the displayed dialog box, enter the password of the current login user and click **OK**. + + After confirming the impact of stopping the service, wait until the service is stopped. + +#. On the **Dashboard** page, choose **More** > **Reset Database Password**. + + In the displayed dialog box, enter the password of the current login user and click **OK**. + + Select "I have read the information and understand the impact", and click **OK**. + +#. After the password is reset, click **Start Service** on the **Dashboard** page. + +#. In the displayed dialog box, click **OK** and wait until the service is started. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_a_component_running_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_a_component_running_user.rst new file mode 100644 index 0000000..b9e0811 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_a_component_running_user.rst @@ -0,0 +1,86 @@ +:original_name: admin_guide_000257.html + +.. _admin_guide_000257: + +Changing the Password for a Component Running User +================================================== + +Scenario +-------- + +It is recommended that the administrator periodically change the password for each component running user to improve the system O&M security. + +Component running users can be classified into the following two types depending on whether their initial passwords are randomly generated by the system: + +- If the initial password of a component running user is randomly generated by the system, the user is of the machine-machine type. +- If the initial password of a component running user is not randomly generated by the system, the user is of the human-machine type. + +Impact on the System +-------------------- + +If the initial password is randomly generated by the system, the cluster needs to be restarted for the password changing to take effect. Services are unavailable during the restart. + +Prerequisites +------------- + +You have installed the client on any node in the cluster and obtained the IP address of the node. + +Procedure +--------- + +#. Log in to the node where the client is installed as the client installation user + +#. Run the following command to switch to the client directory, for example, **/opt/client**: + + **cd /opt/client** + +#. Run the following command to set environment variables: + + **source bigdata_env** + +#. Run the following command and enter the password of user **kadmin/admin** to log in to the **kadmin** console: + + **kadmin -p kadmin/admin** + + .. note:: + + The default password of user **kadmin/admin**, **Admin@123**, will expire upon your first login. Change the password as prompted and keep the new password secure. + +#. Run the following command to change the password of an internal component running user. The password changing takes effect on all servers. + + **cpw** *Internal system username* + + For example: **cpw oms/manager** + + The password must meet the following complexity requirements by default: + + - The password contains at least 8 characters. + - The password contains at least four types of the following characters: Uppercase letters, lowercase letters, digits, spaces, and special characters which can only be :literal:`~`!?,.;-_'(){}[]/<>@#$%^&*+|\\=.` + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked passwords, for example, **Admin@12345**. + - The password cannot be the same as the password used in latest *N* times. *N* indicates the value of **Number of Historical Passwords** configured in :ref:`Configuring Password Policies `. This policy applies to only human-machine accounts. + + .. note:: + + Run the following command to check user information: + + **getprinc** *Internal system username* + + For example: **getprinc oms/manager** + +#. Determine the type of the user whose password needs to be changed. + + - If the user is a machine-machine user, go to :ref:`7 `. + - If the user is a human-machine user, the password is changed successfully and no further action is required. + +#. .. _admin_guide_000257__li22669737114334: + + Log in to FusionInsight Manager. + +#. Click **Cluster**, click the name of the desired cluster, and choose **More** > **Restart**. + +#. In the displayed window, enter the password of the current login user and click **OK**. + +#. In the displayed restart confirmation dialog box, click **OK**. + +#. Wait for message "Operation successful" to display. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_kerberos_administrator.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_kerberos_administrator.rst new file mode 100644 index 0000000..97efc19 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_kerberos_administrator.rst @@ -0,0 +1,43 @@ +:original_name: admin_guide_000253.html + +.. _admin_guide_000253: + +Changing the Password for the Kerberos Administrator +==================================================== + +Scenario +-------- + +It is recommended that the administrator periodically change the password of Kerberos administrator **kadmin** to improve the system O&M security. + +If the user password is changed, the OMS Kerberos administrator password is changed as well. + +Prerequisites +------------- + +You have installed the client on any node in the cluster and obtained the IP address of the node. + +Procedure +--------- + +#. Log in to the node where the client is installed as user **root**. + +#. Run the following command to go to the client directory, for example, **/opt/hadoopclient**: + + **cd /opt/hadoopclient** + +#. Run the following command to set environment variables: + + **source bigdata_env** + +#. Run the following command to change the password for **kadmin/admin**. The password changing takes effect on all servers. + + **kpasswd kadmin/admin** + + The password must meet the following complexity requirements by default: + + - The password contains at least 8 characters. + - The password contains at least four types of the following characters: Uppercase letters, lowercase letters, digits, spaces, and special characters which can only be :literal:`~`!?,.;-_'(){}[]/<>@#$%^&*+|\\=.` + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked passwords, for example, **Admin@12345**. + - The password cannot be the same as the password used in the last *N* times. *N* indicates the value of **Repetition Rule** in :ref:`Configuring Password Policies `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_ldap_administrator.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_ldap_administrator.rst new file mode 100644 index 0000000..81d020c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_ldap_administrator.rst @@ -0,0 +1,69 @@ +:original_name: admin_guide_000256.html + +.. _admin_guide_000256: + +Changing the Password for the LDAP Administrator +================================================ + +Scenario +-------- + +It is recommended that the administrator periodically changes the passwords of LDAP administrator accounts **cn=krbkdc,ou=Users,dc=hadoop,dc=com** and **cn=krbadmin,ou=Users,dc=hadoop,dc=com** to improve the system O&M security. + +Impact on the System +-------------------- + +- You need to restart the KrbServer service after changing the password. + +- After the password is changed, check whether the LDAP administrator accounts **cn=krbkdc,ou=Users,dc=hadoop,dc=com** and **cn=krbadmin,ou=Users,dc=hadoop,dc=com** are locked, run the following command on the active management node of the cluster to check whether **krbkdc** is locked (the method for user **krbadmin** is similar): + + .. note:: + + OLdap port number obtaining method: + + #. Log in to FusionInsight Manager, choose **System** > **OMS** > **oldap** > **Modify Configuration**: + #. The **LDAP Listening Port** parameter value is **oldap port**. + + **ldapsearch -H ldaps://**\ *OMS_FLOAT\_ IP address:OLdap port* **-LLL -x -D** **cn=krbkdc,ou=Users,dc=hadoop,dc=com -W -b cn=krbkdc,ou=Users,dc=hadoop,dc=com -e ppolicy** + + Enter the password of the LDAP administrator account **krbkdc**. If the following message is displayed, the account is locked. For details about how to unlock the account, see :ref:`Unlocking LDAP Users and Management Accounts `. + + .. code-block:: + + ldap_bind: Invalid credentials (49); Account locked + +Prerequisites +------------- + +You have obtained the management node IP address. + +Procedure +--------- + +#. Log in to the active management node as user **omm** with the IP address of the active management node. + +#. Run the following command to go to the related directory: + + **cd ${BIGDATA_HOME}/om-server/om/meta-0.0.1-SNAPSHOT/kerberos/scripts** + +#. Run the following command to change the password of the LDAP administrator account: + + **./okerberos_modpwd.sh** + + Enter the old password and then enter a new password twice. + + The password complexity requirements are as follows: + + - The password contains 16 to 32 characters. + - The password contains at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters which can only be :literal:`\`~!@#$%^&*()-_=+|[{}];,<.>/?.` + - The password cannot be the same as the current password. + + If the following information is displayed, the password is changed successfully. + + .. code-block:: + + Modify kerberos server password successfully. + +#. Log in to FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Services** > **KrbServer**. On the displayed page, choose **More** > **Restart Service**. + + Enter the password and do not select **Restart upper-layer services**. Click **OK** to restart the KrbServer service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_oms_kerberos_administrator.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_oms_kerberos_administrator.rst new file mode 100644 index 0000000..e77865d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_password_for_the_oms_kerberos_administrator.rst @@ -0,0 +1,38 @@ +:original_name: admin_guide_000254.html + +.. _admin_guide_000254: + +Changing the Password for the OMS Kerberos Administrator +======================================================== + +Scenario +-------- + +It is recommended that the administrator periodically change the password of OMS Kerberos administrator **kadmin** to improve the system O&M security. + +If the user password is changed, the Kerberos administrator password is changed as well. + +Procedure +--------- + +#. Log in to any management node in the cluster as user **omm**. + +#. Run the following command to go to the related directory: + + **cd ${BIGDATA_HOME}/om-server/om/meta-0.0.1-SNAPSHOT/kerberos/scripts** + +#. Run the following command to set environment variables: + + **source component_env** + +#. Run the following command to change the password for **kadmin/admin**. The password changing takes effect on all servers. + + **kpasswd kadmin/admin** + + The password must meet the following complexity requirements by default: + + - The password contains at least 8 characters. + - The password contains at least four types of the following characters: uppercase letters, lowercase letters, digits, and special characters can only be :literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=.` + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked passwords, for example, **Admin@12345**. + - The password cannot be the same as the password used in the last *N* times. *N* indicates the value of **Repetition Rule** in :ref:`Configuring Password Policies `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user_including_oms_ldap.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user_including_oms_ldap.rst new file mode 100644 index 0000000..f9c5241 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user_including_oms_ldap.rst @@ -0,0 +1,67 @@ +:original_name: admin_guide_000255.html + +.. _admin_guide_000255: + +Changing the Passwords of the LDAP Administrator and the LDAP User (Including OMS LDAP) +======================================================================================= + +Scenario +-------- + +It is recommended that the administrator periodically changes the passwords of LDAP administrator **cn=root,dc=hadoop,dc=com** and LDAP user **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com** to improve the system O&M security. + +If the passwords are changed, the password of the OMS LDAP administrator or user is changed as well. + +.. note:: + + If the cluster is upgraded from an early version to a latest version, the LDAP administrator password will inherit the password policy of the old cluster. To ensure system security, you are advised to change the password after the cluster upgrade. + +Impact on the System +-------------------- + +- Changing the user password of the LdapServer service is a high-risk operation and requires restarting the KrbServer and LdapServer services. If KrbServer is restarted, users may fail to be queried by running the **id** command on nodes in the cluster temporarily. Therefore, exercise caution when restarting KrbServer. +- After the password of LDAP user **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com** is changed, the user may be locked in the LDAP component. Therefore, you are advised to unlock the user after changing the password. For details about how to unlock the user, see :ref:`Unlocking LDAP Users and Management Accounts `. + +Prerequisites +------------- + +Before changing the password of LDAP user **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com**, ensure that the user is not locked by running the following command on the active management node of the cluster: + +.. note:: + + To query the OLdap port number, perform the following steps: + + #. Log in to FusionInsight Manager, choose **System** > **OMS** > **oldap** > **Modify Configuration**: + #. The value of **LDAP Service Listening Port** is the OLDAP port. + +**ldapsearch -H ldaps://**\ *Floating IP address of OMS:OLDAP port*\ **-LLL -x -D** **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com -W -b** **cn=pg_search_dn,ou=Users,dc=hadoop,dc=com -e ppolicy** + +Enter the password of the LDAP user **pg_search_dn**. If the following information is displayed, the user is locked. In this case, unlock the user. For details, see :ref:`Unlocking LDAP Users and Management Accounts `. + +.. note:: + + The password of the LDAP user **pg_search_dn** is randomly generated by the system. You can obtain the password from the **/etc/sssd/sssd.conf or /etc/ldap.conf** file on the active node. + +.. code-block:: + + ldap_bind: Invalid credentials (49); Account locked + +Procedure +--------- + +#. Log in to FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Service** > **LdapServer**. + +#. Choose **More** > **Change Database Password**. In the displayed dialog box, enter the password of the current login user and click **OK**. + +#. In the **Change Password** dialog box, select the user whose password to be modified in the **User Information** drop-down box. + +#. Enter the old password in the **Old Password** text box, and enter the new password in the **New Password** and **Confirm Password** text boxes. + + The password must meet the following complexity requirements by default: + + - The password contains 16 to 32 characters. + - The password contains at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters which can only be :literal:`\`~!@#$%^&*()-_=+|[{}];,<.>/?.` + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be the same as the current password. + +#. Select "I have read the information and understood the impact" and click **OK** to confirm the modification and restart the service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/index.rst new file mode 100644 index 0000000..33ef2ad --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_internal_user/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000252.html + +.. _admin_guide_000252: + +Changing the Password for a System Internal User +================================================ + +- :ref:`Changing the Password for the Kerberos Administrator ` +- :ref:`Changing the Password for the OMS Kerberos Administrator ` +- :ref:`Changing the Passwords of the LDAP Administrator and the LDAP User (Including OMS LDAP) ` +- :ref:`Changing the Password for the LDAP Administrator ` +- :ref:`Changing the Password for a Component Running User ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + changing_the_password_for_the_kerberos_administrator + changing_the_password_for_the_oms_kerberos_administrator + changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user_including_oms_ldap + changing_the_password_for_the_ldap_administrator + changing_the_password_for_a_component_running_user diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_an_os_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_an_os_user.rst new file mode 100644 index 0000000..6362bb1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_an_os_user.rst @@ -0,0 +1,42 @@ +:original_name: admin_guide_000251.html + +.. _admin_guide_000251: + +Changing the Password for an OS User +==================================== + +Scenario +-------- + +During FusionInsight Manager installation, the system automatically creates user **omm** and **ommdba** on each node in the cluster. Periodically change the login passwords of the OS users **omm** and **ommdba** of the cluster node to improve the system O&M security. + +The passwords of users **omm** and **ommdba** of the nodes can be different. + +Prerequisites +------------- + +- You have obtained the IP address of the node where the passwords of users **omm** and **ommdba** are to be changed. +- You have obtained the password of user **root** before changing the passwords of users **omm** and **ommdba**. + +Changing the Password of an OS User +----------------------------------- + +#. Log in to the node where the password is to be changed as user **root**. + +#. Run the following command to change the user password: + + **passwd** *ommdba* + + Red Hat system displays the following information: + + .. code-block:: + + Changing password for user ommdba. + New password: + +#. Enter a new password. The policy for changing the password of an OS user varies according to the OS that is actually used. + + .. code-block:: + + Retype New Password: + Password changed. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_user_admin.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_user_admin.rst new file mode 100644 index 0000000..e2ca247 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/changing_the_password_for_user_admin.rst @@ -0,0 +1,32 @@ +:original_name: admin_guide_000250.html + +.. _admin_guide_000250: + +Changing the Password for User admin +==================================== + +Scenario +-------- + +User **admin** is the system administrator account of FusionInsight Manager. You are advised to periodically change the password on FusionInsight Manager to improve system security. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + + User **admin** is required for login. + +#. Move the cursor to **Hello, admin** in the upper right corner of the page. + + In the displayed menu, click **Change Password**. + +#. Set **Old Password**, **New Password**, and **Confirm Password**, and click **OK**. + + The password must meet the following complexity requirements by default: + + - The password contains 8 to 64 characters. + - The password contains at least four types of the following characters: Uppercase letters, lowercase letters, digits, spaces, and special characters which can only be :literal:`~`!?,.;-_'(){}[]/<>@#$%^&*+|\\=.` + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked passwords. + - The password cannot be the same as the password used in the last *N* times. *N* indicates the value of **Repetition Rule** in :ref:`Configuring Password Policies `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/index.rst new file mode 100644 index 0000000..21bf06e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/changing_the_password_for_a_system_user/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000249.html + +.. _admin_guide_000249: + +Changing the Password for a System User +======================================= + +- :ref:`Changing the Password for User admin ` +- :ref:`Changing the Password for an OS User ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + changing_the_password_for_user_admin + changing_the_password_for_an_os_user diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/index.rst new file mode 100644 index 0000000..87df2e4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/account_management/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000242.html + +.. _admin_guide_000242: + +Account Management +================== + +- :ref:`Account Security Settings ` +- :ref:`Changing the Password for a System User ` +- :ref:`Changing the Password for a System Internal User ` +- :ref:`Changing the Password for a Database User ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + account_security_settings/index + changing_the_password_for_a_system_user/index + changing_the_password_for_a_system_internal_user/index + changing_the_password_for_a_database_user/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/index.rst new file mode 100644 index 0000000..36de579 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000233.html + +.. _admin_guide_000233: + +Security Management +=================== + +- :ref:`Security Overview ` +- :ref:`Account Management ` +- :ref:`Security Hardening ` +- :ref:`Security Maintenance ` +- :ref:`Security Statement ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + security_overview/index + account_management/index + security_hardening/index + security_maintenance/index + security_statement diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_a_trusted_ip_address_to_access_ldap.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_a_trusted_ip_address_to_access_ldap.rst new file mode 100644 index 0000000..dc26005 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_a_trusted_ip_address_to_access_ldap.rst @@ -0,0 +1,173 @@ +:original_name: admin_guide_000274.html + +.. _admin_guide_000274: + +Configuring a Trusted IP Address to Access LDAP +=============================================== + +Scenario +-------- + +By default, the LDAP service deployed in the OMS and cluster can be accessed by any IP address. To enable the LDAP service to be accessed by only trusted IP addresses, you can configure the INPUT policy in the iptables filtering list. + +Impact on the System +-------------------- + +After the configuration, the LDAP service cannot be accessed by IP addresses that are not configured. Before the expansion, the added IP addresses need to be configured as trusted IP addresses. + +Prerequisites +------------- + +- You have collected the management plane IP addresses and service plane IP addresses of all nodes in the cluster and all floating IP addresses. +- You have obtained the **root** user account for all nodes in the cluster. + +Procedure +--------- + +**Configuring trusted IP addresses for the LDAP service on the OMS** + +#. Confirm the management node IP address. For details, see :ref:`Logging In to the Management Node `. + +#. Log in to FusionInsight Manager. For details, see :ref:`Logging In to FusionInsight Manager `. + +#. Choose **System** > **OMS** and choose **oldap** > **Modify Configuration** to view the OMS LDAP port number, that is, the value of **LDAP Listening Port**. The default port number is **21750**. + +#. Log in to the active management node as user **root** using the IP address of the active management node. + +#. .. _admin_guide_000274__li727167195016: + + Run the following command to check the INPUT policy in the iptables filtering list: + + **iptables -L** + + For example, if no rule is configured, the INPUT policy is displayed as follows: + + .. code-block:: + + Chain INPUT (policy ACCEPT) + target prot opt source destination + +#. Run the following command to configure all IP addresses used by the cluster as trusted IP addresses. Each IP address needs to be added independently. + + **iptables -A INPUT -s** *Trusted IP address* **-p tcp --dport** *Port number* **-j ACCEPT** + + For example, to configure **10.0.0.1** as a trusted IP address and enable it to access port **21750**, you need to run the following command: + + **iptables -A INPUT -s 10.0.0.1 -p tcp --dport 21750 -j ACCEPT** + +#. Run the following command to configure all IP addresses as untrusted IP addresses. The trusted IP addresses will not be affected by this rule. + + **iptables -A INPUT -p tcp --dport** *Port number* **-j DROP** + + For example, to disable all IP addresses to access port **21750**, run the following command: + + **iptables -A INPUT -p tcp --dport 21750 -j DROP** + +#. Run the following command to view the modified INPUT policy in the iptables filtering list: + + **iptables -L** + + For example, after a trusted IP address is configured, the INPUT policy is displayed as follows: + + .. code-block:: + + Chain INPUT (policy ACCEPT) + target prot opt source destination + ACCEPT tcp -- 10.0.0.1 anywhere tcp dpt:21750 + DROP tcp -- anywhere anywhere tcp dpt:21750 + +#. Run the following command to view the rules and rule numbers in the iptables filtering list: + + **iptables -L -n --line-number** + + .. code-block:: + + Chain INPUT (policy ACCEPT) + num target prot opt source destination + 1 DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:21750 + +#. .. _admin_guide_000274__li28581039195016: + + Run the following command to delete the desired rule from the iptables filtering list based on site requirement: + + **iptables -D INPUT** *Number of the rule to be deleted* + + For example, to delete rule 1, run the following command: + + **iptables -D INPUT 1** + +#. Log in to the standby management node as user **root** using the standby IP address. Repeat :ref:`5 ` to :ref:`10 `. + +**Configuring trusted IP addresses for the LDAP service in the cluster** + +12. Log in to FusionInsight Manager. + +13. Click **Cluster**, click the name of the desired cluster, and choose **Service** > **LdapServer**. On the displayed page, click **Instance** to view the nodes where the LDAP services locate. + +14. Go to the **Configurations** page, and view the LDAP port number of the cluster, that is, the value of **LDAP_SERVER_PORT**. The default value is **21780**. + +15. Log in to the LDAP node as user **root** using the LDAP service IP address. + +16. .. _admin_guide_000274__li41253757195016: + + Run the following command to view the INPUT policy in the iptables filtering list: + + **iptables -L** + + For example, if no rule is configured, the INPUT policy is displayed as follows: + + .. code-block:: + + Chain INPUT (policy ACCEPT) + target prot opt source destination + +17. Run the following command to configure all IP addresses used by the cluster as trusted IP addresses. Each IP address needs to be added independently. + + **iptables -A INPUT -s** *Trusted IP address* **-p tcp --dport** *Port number* **-j ACCEPT** + + For example, to configure **10.0.0.1** as a trusted IP address and enable it to access port **21780**, you need to run the following command: + + **iptables -A INPUT -s 10.0.0.1 -p tcp --dport 21780 -j ACCEPT** + +18. Run the following command to configure all IP addresses as untrusted IP addresses. The trusted IP addresses will not be affected by this rule. + + **iptables -A INPUT -p tcp --dport** *Port number* **-j DROP** + + For example, to disable all IP addresses to access port **21780**, run the following command: + + **iptables -A INPUT -p tcp --dport 21780 -j DROP** + +19. Run the following command to view the modified INPUT policy in the iptables filtering list: + + **iptables -L** + + For example, after a trusted IP address is configured, the INPUT policy is displayed as follows: + + .. code-block:: + + Chain INPUT (policy ACCEPT) + target prot opt source destination + ACCEPT tcp -- 10.0.0.1 anywhere tcp dpt:21780 + DROP tcp -- anywhere anywhere tcp dpt:21780 + +20. Run the following command to view the rules and rule numbers in the iptables filtering list: + + **iptables -L -n --line-number** + + .. code-block:: + + Chain INPUT (policy ACCEPT) + num target prot opt source destination + 1 DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:21780 + +21. .. _admin_guide_000274__li48007687195016: + + Run the following command to delete the desired rule from the iptables filtering list based on site requirement: + + **iptables -D INPUT** *Number of the rule to be deleted* + + For example, to delete rule 1, run the following command: + + **iptables -D INPUT 1** + +22. Log in to the LDAP node as user **root** using the IP address of another LDAP service, and repeat :ref:`16 ` to :ref:`21 `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_an_ip_address_whitelist_for_modification_allowed_by_hbase.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_an_ip_address_whitelist_for_modification_allowed_by_hbase.rst new file mode 100644 index 0000000..2eadbc1 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_an_ip_address_whitelist_for_modification_allowed_by_hbase.rst @@ -0,0 +1,28 @@ +:original_name: admin_guide_000278.html + +.. _admin_guide_000278: + +Configuring an IP Address Whitelist for Modification Allowed by HBase +===================================================================== + +If the Replication function is enabled for HBase clusters, a protection mechanism for data modification is added on the standby HBase cluster to ensure data consistency between the active and standby clusters. Upon receiving an RPC request for data modification, the standby HBase cluster checks the permission of the user who sends the request (only HBase manage users have the modification permission). Then it checks the validity of the source IP address of the request. Only modification requests from IP addresses in the white list are accepted. The IP address white list is configured by the **hbase.replication.allowedIPs** item. + +Log in to FusionInsight Manager and choose **Cluster** > **Services** > **HBase**. Click **Configurations** and enter the parameter name in the search box. + +.. table:: **Table 1** Parameter description + + +------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +==============================+====================================================================================================================================================================================================================================+=======================+ + | hbase.replication.allowedIPs | Allows replication request processing from configured IP addresses only. It supports comma separated regex patterns. Each pattern can be any of the following: | N/A | + | | | | + | | - Regex pattern | | + | | | | + | | Example: 10.18.40.*, 10.18.*, 10.18.40.11 | | + | | | | + | | - Range pattern (Range can be specified only in the last octet) | | + | | | | + | | Example: 10.18.40.[10-20] | | + | | | | + | | If this item is empty (default value), the white list contains only the IP address of the RegionServer of the cluster, indicating that only modification requests from the RegionServer of the standby HBase cluster are accepted. | | + +------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hadoop_security_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hadoop_security_parameters.rst new file mode 100644 index 0000000..7a340fb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hadoop_security_parameters.rst @@ -0,0 +1,68 @@ +:original_name: admin_guide_000277.html + +.. _admin_guide_000277: + +Configuring Hadoop Security Parameters +====================================== + +Configuring Security Channel Encryption +--------------------------------------- + +The channels between components are not encrypted by default. You can set the following parameters to configure security channel encryption. + +Page access for setting parameters: On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, click **Services**, and click the target service. On the displayed page, click **Configuration** and click **All Configurations**. Enter a parameter name in the search box. + +.. note:: + + Restart corresponding services for the modification to take effect after you modify configuration parameters. + +.. table:: **Table 1** Parameter description + + +-----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ + | Service | Parameter | Description | Default Value | + +=================+=====================================+==================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+====================================+ + | HBase | hbase.rpc.protection | Indicates whether the HBase channels, including the remote procedure call (RPC) channels for HBase clients to access the HBase server and the RPC channels between the HMaster and RegionServer, are encrypted. If this parameter is set to **privacy**, the channels are encrypted and the authentication, integrity, and privacy functions are enabled. If this parameter is set to **integrity**, the channels are not encrypted and only the authentication and integrity functions are enabled. If this parameter is set to **authentication**, the channels are not encrypted, only packets are authenticated, and integrity and privacy are not required. | ``-`` | + | | | | | + | | | .. note:: | | + | | | | | + | | | The privacy mode encrypts transmitted content, including sensitive information such as user tokens, to ensure the security of the transmitted content. However, this mode has great impact on performance. Compared with the other two modes, this mode reduces read/write performance by about 60%. Modify the configuration based on the enterprise security requirements. The configuration items on the client and server must be the same. | | + +-----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ + | HDFS | dfs.encrypt.data.transfer | Indicates whether the HDFS data transfer channels and the channels for clients to access HDFS are encrypted. The HDFS data transfer channels include the data transfer channels between DataNodes and the Data Transfer (DT) channels for clients to access DataNodes. The value **true** indicates that the channels are encrypted. The channels are not encrypted by default. | false | + +-----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ + | HDFS | dfs.encrypt.data.transfer.algorithm | Indicates whether the HDFS data transfer channels and the channels for clients to access HDFS are encrypted. This parameter is valid only when **dfs.encrypt.data.transfer** is set to **true**. | 3des | + | | | | | + | | | The default value is **3des**, indicating that 3DES algorithm is used to encrypt data. The value can also be set to **rc4**. However, to avoid security risks, you are not advised to set the parameter to this value. | | + +-----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ + | HDFS | hadoop.rpc.protection | Indicates whether the RPC channels of each module in Hadoop are encrypted. The channels include: | - Security mode: **privacy** | + | | | | - Normal mode: **authentication** | + | | | - RPC channels for clients to access HDFS | | + | | | - RPC channels between modules in HDFS, for example, between DataNode and NameNode | | + | | | - RPC channels for clients to access YARN | | + | | | - RPC channels between NodeManager and ResourceManager | | + | | | - RPC channels for Spark to access YARN and HDFS | | + | | | - RPC channels for MapReduce to access YARN and HDFS | | + | | | - RPC channels for HBase to access HDFS | | + | | | | | + | | | The default value is **privacy**, indicating encrypted transmission. The value **authentication** indicates that transmission is not encrypted. | | + | | | | | + | | | .. note:: | | + | | | | | + | | | You can set this parameter on the HDFS component configuration page. The parameter setting is valid globally, that is, the setting of whether the RPC channel is encrypted takes effect on all modules in Hadoop. | | + +-----------------+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ + +Setting the Maximum Number of Concurrent Web Connections +-------------------------------------------------------- + +To ensure web server reliability, new connections are rejected when the number of user connections reaches a specific threshold. This prevents DDOS attacks and service unavailability caused by too many users accessing the web server at the same time. + +Page access for setting parameters: On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, click **Services**, and click the target service. On the displayed page, click **Configuration** and click **All Configurations**. Enter a parameter name in the search box. + +.. table:: **Table 2** Parameter description + + +-----------+--------------------------------+-------------------------------------------------------------------------------+---------------+ + | Service | Parameter | Description | Default Value | + +===========+================================+===============================================================================+===============+ + | HDFS/Yarn | hadoop.http.server.MaxRequests | Specifies the maximum number of concurrent web connections of each component. | 2000 | + +-----------+--------------------------------+-------------------------------------------------------------------------------+---------------+ + | Spark2x | spark.connection.maxRequest | Specifies the maximum number of request connections of JobHistory. | 5000 | + +-----------+--------------------------------+-------------------------------------------------------------------------------+---------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hdfs_data_encryption_during_transmission.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hdfs_data_encryption_during_transmission.rst new file mode 100644 index 0000000..d736404 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_hdfs_data_encryption_during_transmission.rst @@ -0,0 +1,60 @@ +:original_name: admin_guide_000282.html + +.. _admin_guide_000282: + +Configuring HDFS Data Encryption During Transmission +==================================================== + +Configuring HDFS Security Channel Encryption +-------------------------------------------- + +The channel between components is not encrypted by default. You can set parameters to enable security channel encryption. + +Navigation path for setting parameters: On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations**. On the displayed page, click the **All Configurations** tab. Enter a parameter name in the search box. + +.. note:: + + After the configuration, restart the corresponding service for the settings to take effect. + +.. table:: **Table 1** Parameters + + +-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+ + | Configuration Item | Description | Default Value | + +=========================================+=================================================================================================================================================================================================================================================================================================================================================================================+===========================================================================================================+ + | hadoop.rpc.protection | .. important:: | - Security mode: privacy | + | | | - Normal mode: authentication | + | | NOTICE: | | + | | | .. note:: | + | | - The setting takes effect only after the service is restarted. Rolling restart is not supported. | | + | | - After the setting, you need to download the client configuration file again. Otherwise, HDFS cannot provide the read and write services. | - **authentication**: indicates that only authentication is required. | + | | | - **integrity**: indicates that authentication and consistency check need to be performed. | + | | Indicates whether the RPC channels of each module in Hadoop are encrypted. The channels include: | - **privacy**: indicates that authentication, consistency check, and encryption need to be performed. | + | | | | + | | - RPC channels for clients to access HDFS | | + | | - RPC channels between modules in HDFS, for example, between DataNode and NameNode | | + | | - RPC channels for clients to access Yarn | | + | | - RPC channels between NodeManager and ResourceManager | | + | | - RPC channels for Spark to access Yarn and HDFS | | + | | - RPC channels for MapReduce to access Yarn and HDFS | | + | | - RPC channels for HBase to access HDFS | | + | | | | + | | .. note:: | | + | | | | + | | The setting takes effect globally, that is, the encryption attribute of the RPC channel of each module in the Hadoop takes effect. | | + +-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+ + | dfs.encrypt.data.transfer | Indicates whether the HDFS data transfer channels and the channels for clients to access HDFS are encrypted. The HDFS data transfer channels include the data transfer channels between DataNodes and the Data Transfer (DT) channels for clients to access DataNodes. The value **true** indicates that the channels are encrypted. The channels are not encrypted by default. | false | + | | | | + | | .. note:: | | + | | | | + | | - This parameter is valid only when **hadoop.rpc.protection** is set to **privacy**. | | + | | - If a large amount of service data is transmitted, enabling encryption by default severely affects system performance. | | + | | - If data transmission encryption is configured for one cluster in the trusted cluster, the same data transmission encryption must be configured for the peer cluster. | | + +-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+ + | dfs.encrypt.data.transfer.algorithm | Indicates the algorithm to encrypt the HDFS data transfer channels and the channels for clients to access HDFS. This parameter is valid only when **dfs.encrypt.data.transfer** is set to **true**. | 3des | + | | | | + | | .. note:: | | + | | | | + | | The default value is **3des**, indicating that 3DES algorithm is used to encrypt data. The value can also be set to **rc4**. However, to avoid security risks, you are not advised to set the parameter to this value. | | + +-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+ + | dfs.encrypt.data.transfer.cipher.suites | This parameter can be left empty or set to **AES/CTR/NoPadding** to specify the cipher suite for data encryption. If this parameter is not specified, the encryption algorithm specified by **dfs.encrypt.data.transfer.algorithm** is used for data encryption. The default value is **AES/CTR/NoPadding**. | AES/CTR/NoPadding | + +-----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_kafka_data_encryption_during_transmission.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_kafka_data_encryption_during_transmission.rst new file mode 100644 index 0000000..1bbcdf6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/configuring_kafka_data_encryption_during_transmission.rst @@ -0,0 +1,38 @@ +:original_name: admin_guide_000281.html + +.. _admin_guide_000281: + +Configuring Kafka Data Encryption During Transmission +===================================================== + +Scenario +-------- + +Data between the Kafka client and the broker is transmitted in plain text. The Kafka client may be deployed in an untrusted network, exposing the transmitting data to leakage and tampering risks. + +Procedure +--------- + +The channel between components is not encrypted by default. You can set the following parameters to enable security channel encryption. + +Page access for setting parameters: On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and choose **Services** > **Kafka**. On the displayed page, click **Configuration** and click **All Configurations**. Enter a parameter name in the search box. + +.. note:: + + After the configuration, restart the corresponding service for the settings to take effect. + +:ref:`Table 1 ` describes the parameters related to transmission encryption on the Kafka server. + +.. _admin_guide_000281__en-us_topic_0046736711_d0e28839: + +.. table:: **Table 1** Parameters relevant to Kafka data encryption during transmission + + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ + | Parameter | Description | Default Value | + +================================+=========================================================================================================================================================================================+================+ + | ssl.mode.enable | Indicates whether to enable the Secure Sockets Layer (SSL) protocol. If this parameter is set to **true**, services relevant to the SSL protocol are started during the broker startup. | false | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ + | security.inter.broker.protocol | Indicates communication protocol between brokers. The communication protocol can be PLAINTEXT, SSL, SASL_PLAINTEXT, or SASL_SSL. | SASL_PLAINTEXT | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ + +The SSL protocol can be configured for the server or client to encrypt transmission and communication only after **ssl.mode.enable** is set to **true** and broker enables the **SSL** and **SASL_SSL** protocols. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/encrypting_the_communication_between_the_controller_and_the_agent.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/encrypting_the_communication_between_the_controller_and_the_agent.rst new file mode 100644 index 0000000..0ab05f3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/encrypting_the_communication_between_the_controller_and_the_agent.rst @@ -0,0 +1,51 @@ +:original_name: admin_guide_000284.html + +.. _admin_guide_000284: + +Encrypting the Communication Between the Controller and the Agent +================================================================= + +Scenario +-------- + +After a cluster is installed, Controller and Agent need to communicate with each other. The Kerberos authentication is used during the communication. By default, the communication is not encrypted during the communication for the sake of cluster performance. Users who have demanding security requirements can use the method described in this section for encryption. + +Impact on the System +-------------------- + +- Controller and all Agents automatically restart, which interrupts FusionInsight Manager. +- The performance of management nodes deteriorates in large clusters. You are advised to enable the encryption function for clusters with a maximum of 200 nodes. + +Prerequisites +------------- + +You have obtained the IP addresses of the active and standby management nodes. + +Procedure +--------- + +#. Log in to the active management node as user **omm**. + +#. Run the following command to disable logout upon timeout: + + **TMOUT=0** + + .. note:: + + After the operations in this section are complete, run the **TMOUT=**\ *Timeout interval* command to restore the timeout interval in a timely manner. For example, **TMOUT=600** indicates that a user is logged out if the user does not perform any operation within 600 seconds. + +#. Run the following command to go to the related directory: + + **cd ${CONTROLLER_HOME}/sbin** + +#. Run the following command to enable communication encryption: + + **./enableRPCEncrypt.sh -t** + + Run the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command to check whether **ResHAStatus** of the active management node Controller is **Normal** and whether you can log in to FusionInsight Manager again. If yes, the enablement is successful. + +#. Run the following command to disable communication encryption when necessary: + + **./enableRPCEncrypt.sh -f** + + Run the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command to check whether **ResHAStatus** of the active management node Controller is **Normal** and whether you can log in to FusionInsight Manager again. If yes, the enablement is successful. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_policies.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_policies.rst new file mode 100644 index 0000000..b3ed7df --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_policies.rst @@ -0,0 +1,53 @@ +:original_name: admin_guide_000272.html + +.. _admin_guide_000272: + +Hardening Policies +================== + +Hardening Tomcat +---------------- + +Tomcat is hardened as follows based on open-source software during FusionInsight Manager software installation and use: + +- The Tomcat version is upgraded to the official version. +- Permissions on the directories under applications are set to **500**, and the write permission on some directories is supported. +- The Tomcat installation package is automatically deleted after the system software is installed. +- The automatic deployment function is disabled for projects in application directories. Only the **web**, **cas**, and **client** projects are deployed. +- Some unused **http** methods are disabled, preventing attacks by using the **http** methods. +- The default shutdown port and command of the Tomcat server are changed to prevent hackers from shutting down the server and attacking servers and applications. +- To ensure security, the value of **maxHttpHeaderSize** is changed, which enables server administrators to control abnormal requests of clients. +- The Tomcat version description file is modified after Tomcat is installed. +- To prevent disclosure of Tomcat information, the Server attributes of Connector are modified so that attackers cannot obtain information about the server. +- Permissions on files and directories of Tomcat, such as the configuration files, executable files, log directories, and temporary folders, are under control. +- Session facade recycling is disabled to prevent request leakage. +- LegacyCookieProcessor is used as CookieProcessor to prevent the leakage of sensitive data in cookies. + +Hardening LDAP +-------------- + +LDAP is hardened as follows after a cluster is installed: + +- In the LDAP configuration file, the password of the administrator account is encrypted using SHA. After the OpenLDAP is upgraded to 2.4.39 or later, data is automatically synchronized between the active and standby LDAP nodes using the SASL External mechanism, which prevents disclosure of the password. +- The LDAP service in the cluster supports the SSLv3 protocol by default, which can be used safely. When the OpenLDAP is upgraded to 2.4.39 or later, the LDAP automatically uses TLS1.0 or later to prevent unknown security risks. + +Hardening JDK +------------- + +- If the client process uses the AES256 encryption algorithm, JDK security hardening is required. The operations are as follows: + + Obtain the Java Cryptography Extension (JCE) package whose version matches that of JDK. The JCE package contains **local_policy.jar** and **US_export_policy.jar**. Copy the JAR files to the following directory and replace the files in the directory. + + - Linux: *JDK installation directory*\ **/jre/lib/security** + - Windows: *JDK installation directory*\ **\\jre\\lib\\security** + + .. note:: + + Access the Open JDK open-source community to obtain the JCE file. + +- If the client process uses the SM4 encryption algorithm, the JAR package needs to be updated. + + Obtain **SMS4JA.jar** in the *client installation directory*\ **/JDK/jdk/jre/lib/ext/** directory, and copy the JAR package to the following directory: + + - Linux: *JDK installation directory*\ **/jre/lib/ext/** + - Windows: *JDK installation directory*\ **\\jre\\lib\\ext\\** diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_the_ldap.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_the_ldap.rst new file mode 100644 index 0000000..772e427 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hardening_the_ldap.rst @@ -0,0 +1,43 @@ +:original_name: admin_guide_000280.html + +.. _admin_guide_000280: + +Hardening the LDAP +================== + +Configuring the LDAP Firewall Policy +------------------------------------ + +In the cluster adopting the dual-plane networking, the LDAP is deployed on the service plane. To ensure the LDAP data security, you are advised to configure the firewall policy in the cluster to disable relevant LDAP ports. + +#. Log in to FusionInsight Manager. +#. Click **Cluster**, click the name of the desired cluster, choose **Services** > **LdapServer**, and click **Configurations**. +#. Check the value of **LDAP_SERVER_PORT**, which is the service port of LdapServer. +#. To ensure data security, configure the firewall policy for the whole cluster to disable the LdapServer port based on the customer's firewall environment. + +Enabling the LDAP Audit Log Output +---------------------------------- + +Users can set the audit log output level of the LDAP service and output audit logs in a specified directory, for example, **/var/log/messages**. The logs output can be used to check user activities and operation commands. + +.. note:: + + If the function of LDAP audit log output is enabled, massive logs are generated, affecting the cluster performance. Exercise caution when enabling this function. + +#. Log in to any LdapServer node. + +#. Run the following command to edit the **slapd.conf.consumer** file, and set the value of **loglevel** to **256** (you can run the **man slapd.conf** command on the OS to view the log level definition). + + **cd ${BIGDATA_HOME}/FusionInsight_BASE\_8.1.0.1/install/FusionInsight-ldapserver-2.7.0/ldapserver/local/template** + + **vi slapd.conf.consumer** + + .. code-block:: + + ... + pidfile [PID_FILE_SLAPD_PID] + argsfile [PID_FILE_SLAPD_ARGS] + loglevel 256 + ... + +#. Log in to FusionInsight Manager, click **Cluster**, click the name of the desired cluster, choose **Services** > **LdapServer**. On the displayed page, choose **More** > **Restart Service**. Enter the administrator password and restart the service. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hfile_and_wal_encryption.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hfile_and_wal_encryption.rst new file mode 100644 index 0000000..0c0a6a6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/hfile_and_wal_encryption.rst @@ -0,0 +1,244 @@ +:original_name: admin_guide_000276.html + +.. _admin_guide_000276: + +HFile and WAL Encryption +======================== + +.. _admin_guide_000276__s1948b0b624dc4a0caf5f17669ca5244d: + + +HFile and WAL Encryption +------------------------ + +.. important:: + + - Setting the HFile and WAL encryption mode to SMS4 or AES has a great impact on the system and will cause data loss in case of any misoperation. Therefore, this operation is not recommended. + + - Batch data import using Bulkload does not support data encryption. + +HFile and Write ahead log (WAL) in HBase are not encrypted by default. To encrypt them, perform the following operations. + +#. .. _admin_guide_000276__li61064812194556: + + On any HBase node, run the following commands to create a key file as user **omm**: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1/install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh** **/hbase.jks ** + + - *//hbase.jks* indicates the path for storing the generated JKS file. + - ** indicates the encryption type, which can be SMS4 or AES. + - ** indicates the key length. SMS4 supports 16-bit and AES supports 128-bit. + - ** indicate the alias of the key file. When you create the key file for the first time, retain the default value **omm**. + + For example, to generate an SMS4 encryption key, run the following command: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh /home/hbase/conf/hbase.jks SMS4 16 omm** + + To generate an AES encryption key, run the following command: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh /home/hbase/conf/hbase.jks AES 128 omm** + + .. note:: + + - To ensure operations can be successfully performed, the **/hbase.jks** directory needs to be created in advance, and the cluster operation user must have the **rw** permission of this directory. + - After running the command, enter the same ** four times. The password encrypted in :ref:`3 ` is the same as the password in this step. + +#. Distribute the generated key files to the same directory on all nodes in the cluster and assign read and write permission to user **omm**. + + .. note:: + + - Administrators need to select a safe procedure to distribute keys based on the enterprise security requirements. + - If the key files of some nodes are lost, repeat the step to copy the key files from other nodes. + +#. .. _admin_guide_000276__li59351885194556: + + On FusionInsight Manager, set **hbase.crypto.keyprovider.parameters.encryptedtext** to the encrypted password. Set **hbase.crypto.keyprovider.parameters.uri** to the path and name of the key file. + + - The format of **hbase.crypto.keyprovider.parameters.uri** is **jceks://**\ **. + + ** indicates the path of the key file. For example, if the path of the key file is **/home/hbase/conf/hbase.jks**, set this parameter to **jceks:///home/hbase/conf/hbase.jks**. + + - The format of **hbase.crypto.keyprovider.parameters.encryptedtext** is **. + + ** indicates the encrypted password generated during the key file creation. The parameter value is displayed in ciphertext. Run the following command as user **omm** to obtain the related encrypted password on the nodes where HBase service is installed: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh** + + .. note:: + + After running the command, you need to enter ****. The password is the same as that entered in :ref:`1 `. + +#. On FusionInsight Manager, set **hbase.crypto.key.algorithm** to **SMS4** or **AES** to use SMS4 or AES for HFile encryption. + +#. On FusionInsight Manager, set **hbase.crypto.wal.algorithm** to **SMS4** or **AES** to use SMS4 or AES for WAL encryption. + +#. On FusionInsight Manager, set **hbase.regionserver.wal.encryption** to **true**. + +#. .. _admin_guide_000276__li42092055194556: + + Save the settings and restart the HBase service for the settings to take effect. + +#. .. _admin_guide_000276__li50092082194556: + + Create an HBase table through CLI or code and configure the encryption mode to enable encryption. **** indicates the encryption type, and **d** indicates the column family. + + - When you create an HBase table through CLI, set the encryption mode to SMS4 or AES for the column family. + + **create** '*
*', {*NAME => 'd'*, **ENCRYPTION => '**\ **\ **'**} + + - When you create an HBase table using code, set the encryption mode to SMS4 or AES by adding the following information to the code: + + .. code-block:: + + public void testCreateTable() + { + String tableName = "user"; + Configuration conf = getConfiguration(); + HTableDescriptor htd = new HTableDescriptor(TableName.valueOf(tableName)); + + HColumnDescriptor hcd = new HColumnDescriptor("d"); + //Set the encryption mode to SMS4 or AES. + hcd.setEncryptionType(""); + htd.addFamily(hcd); + + HBaseAdmin admin = null; + try + { + admin = new HBaseAdmin(conf); + + if(!admin.tableExists(tableName)) + { + admin.createTable(htd); + } + } + catch (IOException e) + { + e.printStackTrace(); + } + finally + { + if(admin != null) + { + try + { + admin.close(); + } + catch (IOException e) + { + e.printStackTrace(); + } + } + } + } + +#. If you have configured SMS4 or AES encryption by performing :ref:`1 ` to :ref:`7 `, but do not set the related encryption parameter when creating the table in :ref:`8 `, the inserted data is not encrypted. + + In this case, you can perform the following steps to encrypt the inserted data: + + a. Run the **flush** command for the table to import the data in the memory to the HFile. + + **flush**\ *''* + + b. Run the following commands to modify the table properties: + + **disable**\ *''* + + **alter**\ *''*\ **,**\ **NAME=>**\ *''*\ **,**\ **ENCRYPTION =>** **'**\ *\ *\ **'** + + **enable**\ *''* + + c. Insert a new data record and flush the table. + + .. note:: + + A new data record must be inserted so that the HFile will generate a new HFile and the unencrypted data inserted previously will be rewritten and encrypted. + + **put**\ *''*,\ **'\ id2','f1:c1','value222222222222222222222222222222222'** + + **flush**\ *''* + + d. Perform the following step to rewrite the HFile: + + **major_compact**'*'* + + .. important:: + + During this step, the HBase table is disabled and cannot provide services. Exercise caution when you perform this step. + +Modifying a Key File +-------------------- + +.. important:: + + Modifying a key file has a great impact on the system and will cause data loss in case of any misoperation. Therefore, this operation is not recommended. + +During the :ref:`HFile and WAL Encryption ` operation, the related key file must be generated and its password must be set to ensure system security. After a period of running, you can replace the key file with a new one to encrypt HFile and WAL. + +#. Run the following command to generate a new key file as user **omm**: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh** */hbase.jks* * * + + - */hbase.jks*: indicates the path for storing the generated **hbase.jks** file. The path and file name must be consistent with those of the key file generated in :ref:`HFile and WAL Encryption `. + - **: indicates the alias of the key file. The alias must be different with that of the old key file. + - **: indicates the encryption type, which can be SMS4 or AES. + - ** indicates the key length. SMS4 supports 16-bit and AES supports 128-bit. + + For example, to generate an SMS4 encryption key, run the following command: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh /home/hbase/conf/hbase.jks SMS4 16 omm_new** + + To generate an AES encryption key, run the following command: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh /home/hbase/conf/hbase.jks AES 128 omm_new** + + .. note:: + + - To ensure operations can be successfully performed, the **/hbase.jks** directory needs to be created in advance, and the cluster operation user must have the **rw** permission of this directory. + - After running the command, you need to enter the same ** for three times. This password is the password of the key file. You can use the password of the old file without any security risk. + +#. .. _admin_guide_000276__li5110157194747: + + Distribute the generated key files to the same directory on all nodes in the cluster and assign read and write permission to user **omm**. + + .. note:: + + Administrators need to select a safe procedure to distribute keys based on the enterprise security requirements. + +#. .. _admin_guide_000276__li34317298194747: + + On the HBase service configuration page of FusionInsight Manager, add custom configuration items, set **hbase.crypto.master.key.name** to **omm_new**, set **hbase.crypto.master.alternate.key.name** to **omm**, and save the settings. + +#. .. _admin_guide_000276__li40420234194747: + + Restart the HBase service for the configuration to take effect. + +#. In HBase shell, run the **major compact** command to generate the HFile file based on the new encryption algorithm. + + **major_compact** *''* + +#. You can view the major compact progress from the HMaster web page. + + |image1| + +#. When all items in **Compaction Progress** reach **100%** and those in **Remaining KVs** are **0**, run the following command as user **omm** to destroy the old key file: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh** */hbase.jks * + + - */hbase.jks*: indicates the path for storing the generated **hbase.jks** file. The path and file name must be consistent with those of the key file generated in :ref:`HFile and WAL Encryption `. + - **: indicates the alias of the old key file to be deleted. + + For example: + + **sh ${BIGDATA_HOME}/FusionInsight_HD\_\ 8.1.0.1\ /install/FusionInsight-HBase-2.2.3/hbase/bin/hbase-encrypt.sh /home/hbase/conf/hbase.jks omm** + + .. note:: + + To ensure operations can be successfully performed, the **/hbase.jks** directory needs to be created in advance, and the cluster operation user must have the **rw** permission of this directory. + +#. Repeat :ref:`2 ` and distribute the updated key files again. + +#. Delete the HBase self-defined configuration item **hbase.crypto.master.alternate.key.name** added in :ref:`3 ` from FusionInsight Manager. + +#. Repeat :ref:`4 ` for the configuration take effect. + +.. |image1| image:: /_static/images/en-us_image_0000001369886993.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/index.rst new file mode 100644 index 0000000..d7c490f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/index.rst @@ -0,0 +1,34 @@ +:original_name: admin_guide_000271.html + +.. _admin_guide_000271: + +Security Hardening +================== + +- :ref:`Hardening Policies ` +- :ref:`Configuring a Trusted IP Address to Access LDAP ` +- :ref:`HFile and WAL Encryption ` +- :ref:`Configuring Hadoop Security Parameters ` +- :ref:`Configuring an IP Address Whitelist for Modification Allowed by HBase ` +- :ref:`Updating a Key for a Cluster ` +- :ref:`Hardening the LDAP ` +- :ref:`Configuring Kafka Data Encryption During Transmission ` +- :ref:`Configuring HDFS Data Encryption During Transmission ` +- :ref:`Encrypting the Communication Between the Controller and the Agent ` +- :ref:`Updating SSH Keys for User omm ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hardening_policies + configuring_a_trusted_ip_address_to_access_ldap + hfile_and_wal_encryption + configuring_hadoop_security_parameters + configuring_an_ip_address_whitelist_for_modification_allowed_by_hbase + updating_a_key_for_a_cluster + hardening_the_ldap + configuring_kafka_data_encryption_during_transmission + configuring_hdfs_data_encryption_during_transmission + encrypting_the_communication_between_the_controller_and_the_agent + updating_ssh_keys_for_user_omm diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_a_key_for_a_cluster.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_a_key_for_a_cluster.rst new file mode 100644 index 0000000..74b0d83 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_a_key_for_a_cluster.rst @@ -0,0 +1,69 @@ +:original_name: admin_guide_000279.html + +.. _admin_guide_000279: + +Updating a Key for a Cluster +============================ + +Scenario +-------- + +When a cluster is installed, an encryption key is generated automatically by the system so that the security information in the cluster (such as all database user passwords and key file access passwords) can be stored in encryption mode. After the cluster is installed, if the original key is accidentally disclosed or a new key is required, you can manually update the key. + +Impact on the System +-------------------- + +- After a cluster key is updated, a new key is generated randomly in the cluster. This key is used to encrypt and decrypt the newly stored data. The old key is not deleted, and it is used to decrypt data encrypted using the old key. After security information is modified, for example, a database user password is changed, the new password is encrypted using the new key. +- When a key is updated for a cluster, the cluster must be stopped and cannot be accessed. + +Prerequisites +------------- + +- You have obtained the IP addresses of the active and standby management nodes. For details, see :ref:`Logging In to the Management Node `. +- You have stopped the upper-layer service applications that depend on the cluster. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* and click **Stop**. In the dialog box that is displayed, enter the password of the current user + + and click **OK**. Wait for a while until a message indicating that the operation is successful is displayed. + +#. Log in to the active management node as user **omm**. + +#. Run the following command to disable logout upon timeout: + + **TMOUT=0** + + .. note:: + + After the operations in this section are complete, run the **TMOUT=**\ *Timeout interval* command to restore the timeout interval in a timely manner. For example, **TMOUT=600** indicates that a user is logged out if the user does not perform any operation within 600 seconds. + +#. Run the following command to go to the related directory: + + **cd ${BIGDATA_HOME}/om-server/om/tools** + +#. Run the following command to update the cluster key: + + **sh updateRootKey.sh** + + Enter **y** as prompted. + + .. code-block:: + + The root key update is a critical operation. + Do you want to continue?(y/n): + + If the following information is displayed, the key is updated successfully. + + .. code-block:: + + Step 4-1: The key save path is obtained successfully. + ... + Step 4-4: The root key is sent successfully. + +#. On FusionInsight Manager, click **Cluster**, click the name of the desired cluster, and click **Start**. + + In the displayed dialog box, click **OK**. Wait until a message is displayed, indicating that the startup is successful. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_ssh_keys_for_user_omm.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_ssh_keys_for_user_omm.rst new file mode 100644 index 0000000..700bd2c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_hardening/updating_ssh_keys_for_user_omm.rst @@ -0,0 +1,86 @@ +:original_name: admin_guide_000285.html + +.. _admin_guide_000285: + +Updating SSH Keys for User omm +============================== + +Scenario +-------- + +During cluster installation, the system automatically generate the SSH public key and private key for user **omm** to establish the trust relationship between nodes. After the cluster is installed, if the original keys are accidentally disclosed or new keys are used, the system administrator can perform the following operations to manually change the keys. + +Prerequisites +------------- + +- The cluster has been stopped. +- No other management operations are being performed. + +Procedure +--------- + +#. Log in as user **omm** to the node whose SSH keys need to be replaced. + + If the node is a Manager management node, run the following command on the active management node. + +#. Run the following command to disable logout upon timeout: + + **TMOUT=0** + + .. note:: + + After the operations in this section are complete, run the **TMOUT=**\ *Timeout interval* command to restore the timeout interval in a timely manner. For example, **TMOUT=600** indicates that a user is logged out if the user does not perform any operation within 600 seconds. + +#. Run the following command to generate a key for the node: + + - If the node is a Manager management node, run the following command: + + **sh ${CONTROLLER_HOME}/sbin/update-ssh-key.sh** + + - If the node is a non-Manager management node, run the following command: + + **sh ${NODE_AGENT_HOME}/bin/update-ssh-key.sh** + + If "Succeed to update ssh private key." is displayed when the preceding command is executed, the SSH key is generated successfully. + +4. Run the following command to copy the public key of the node to the active management node: + + **scp ${HOME}/.ssh/id_rsa.pub** *oms_ip*\ **:${HOME}/.ssh/id_rsa.pub_bak** + + *oms_ip*: indicates the IP address of the active management node. + + Enter the password of user **omm** to copy the files. + +5. Log in to the active management node as user **omm**. + +6. Run the following command to disable logout on system timeout: + + **TMOUT=0** + +7. Run the following command to go to the related directory: + + **cd ${HOME}/.ssh** + +8. Run the following command to add new public keys: + + **cat id_rsa.pub_bak >> authorized_keys** + +9. Run the following command to move the temporary public key file, for example, **/tmp**. + + **mv -f id_rsa.pub_bak** **/tmp** + +10. Copy the **authorized_keys** file of the active management node to the other nodes in the cluster: + + **scp authorized_keys** *node_ip*\ **:/${HOME}/.ssh/authorized_keys** + + *node_ip*: indicates the IP address of another node in the cluster. Multiple IP addresses are not supported. + +11. Run the following command to confirm private key replacement without entering the password: + + **ssh** *node_ip* + + *node_ip*: indicates the IP address of another node in the cluster. Multiple IP addresses are not supported. + +12. Log in to FusionInsight Manager. On **Homepage**, locate the desired cluster and choose |image1| > **Start** to start the cluster. + +.. |image1| image:: /_static/images/en-us_image_0263899299.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/account_maintenance_suggestions.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/account_maintenance_suggestions.rst new file mode 100644 index 0000000..81886ac --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/account_maintenance_suggestions.rst @@ -0,0 +1,12 @@ +:original_name: admin_guide_000289.html + +.. _admin_guide_000289: + +Account Maintenance Suggestions +=============================== + +It is recommended that the administrator conduct routine checks on the accounts. The check covers the following items: + +- Check whether the accounts of the OS, FusionInsight Manager, and each component are necessary and whether temporary accounts have been deleted. +- Check whether the permissions of the accounts are appropriate. Different administrators have different rights. +- Check and audit the logins and operation records of all types of accounts. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/index.rst new file mode 100644 index 0000000..eb3441f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000287.html + +.. _admin_guide_000287: + +Security Maintenance +==================== + +- :ref:`Account Maintenance Suggestions ` +- :ref:`Password Maintenance Suggestions ` +- :ref:`Log Maintenance Suggestions ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + account_maintenance_suggestions + password_maintenance_suggestions + log_maintenance_suggestions diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/log_maintenance_suggestions.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/log_maintenance_suggestions.rst new file mode 100644 index 0000000..6b5bc28 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/log_maintenance_suggestions.rst @@ -0,0 +1,23 @@ +:original_name: admin_guide_000291.html + +.. _admin_guide_000291: + +Log Maintenance Suggestions +=========================== + +Operation logs help discover exceptions such as illegal operations and login by unauthorized users. The system records important operations in logs. You can use operation logs to locate problems. + +Checking Logs Regularly +----------------------- + +Check system logs periodically and handle exceptions such as unauthorized operations or logins in a timely manner. + +Backing Up Logs Regularly +------------------------- + +The audit logs provided by FusionInsight Manager and cluster record the user activities and operations. You can export the audit logs on FusionInsight Manager. If there are too many audit logs in the system, you can configure dump parameters to dump audit logs to a specified server to ensure that the cluster nodes disk space is sufficient. + +Maintenance Owner +----------------- + +Network monitoring engineers and system maintenance engineers diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/password_maintenance_suggestions.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/password_maintenance_suggestions.rst new file mode 100644 index 0000000..1eceae4 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_maintenance/password_maintenance_suggestions.rst @@ -0,0 +1,17 @@ +:original_name: admin_guide_000290.html + +.. _admin_guide_000290: + +Password Maintenance Suggestions +================================ + +User identity authentication is a must for accessing the application system. The complexity and validity period of user accounts and passwords must meet customers' security requirements. + +The password maintenance suggestions are as follows: + +#. Dedicated personnel must be arranged to manage the OS password. +#. The passwords must meet the complexity requirements, such as minimum password length or character types. +#. Passwords must be encrypted before transfer. Generally, do not transfer passwords using emails. +#. Passwords must be encrypted in configuration files. +#. Enterprise users need to change the passwords when the system is handed over. +#. Passwords must be periodically changed. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/authentication_policies.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/authentication_policies.rst new file mode 100644 index 0000000..c825423 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/authentication_policies.rst @@ -0,0 +1,78 @@ +:original_name: admin_guide_000237.html + +.. _admin_guide_000237: + +Authentication Policies +======================= + +The big data platform performs user identity authentication to prevent invalid users from accessing the cluster. The cluster provides authentication capabilities in both security mode and normal mode. + +Security Mode +------------- + +The clusters in security mode use the Kerberos authentication protocol for security authentication. The Kerberos protocol supports mutual authentication between clients and servers. This eliminates the risks incurred by sending user credentials over the network for simulated authentication. In clusters, KrbServer provides the Kerberos authentication support. + +**Kerberos user object** + +In the Kerberos protocol, each user object is a principal. A complete principal consists of username and domain name. In O&M or application development scenarios, the user identity must be verified before a client connects to a server. Users for O&M and service operations are classified into human-machine and machine-machine users. The password of human-machine users is manually configured, while the password of machine-machine users is generated by the system randomly. + +**Kerberos authentication** + +Kerberos supports password and keytab authentication. The validity period of authentication is 24 hours by default. + +- Password authentication: User identity is verified by entering the correct password. This mode mainly used in O&M scenarios where human-machine users are used. The configuration command is **kinit** *Username*. +- Keytab authentication: Keytab files contain users' principal and encrypted credential information. When keytab files are used for authentication, the system automatically uses encrypted credential information to perform authentication and the user password does not need to be entered. This mode is mainly used in component application development scenarios where machine-machine users are used. Keytab authentication can also be configured using the **kinit** command. + +Normal Mode +----------- + +Different components in a normal cluster use the native open-source authentication mode and do not support the **kinit** authentication command. FusionInsight Manager (including DBService, KrbServer, and LdapServer) uses the username and password for authentication. :ref:`Table 1 ` lists the authentication modes used by components. + +.. _admin_guide_000237__t7abcbec3c9ea4f04b9e226dbe9d4ca38: + +.. table:: **Table 1** Component authentication modes + + +-----------------------------------+-------------------------------------------------+ + | Service | Authentication Mode | + +===================================+=================================================+ + | ClickHouse | Simple authentication | + +-----------------------------------+-------------------------------------------------+ + | Flume | No authentication | + +-----------------------------------+-------------------------------------------------+ + | HBase | - Web UI: No authentication | + | | - Client: simple authentication | + +-----------------------------------+-------------------------------------------------+ + | HDFS | - Web UI: no authentication | + | | - Client: simple authentication | + +-----------------------------------+-------------------------------------------------+ + | Hive | Simple authentication | + +-----------------------------------+-------------------------------------------------+ + | Hue | Username and password authentication | + +-----------------------------------+-------------------------------------------------+ + | Kafka | No authentication | + +-----------------------------------+-------------------------------------------------+ + | Loader | - Web UI: username and password authentication | + | | - Client: no authentication | + +-----------------------------------+-------------------------------------------------+ + | MapReduce | - Web UI: no authentication | + | | - Client: no authentication | + +-----------------------------------+-------------------------------------------------+ + | Oozie | - Web UI: username and password authentication | + | | - Client: simple authentication | + +-----------------------------------+-------------------------------------------------+ + | Spark2x | - Web UI: no authentication | + | | - Client: simple authentication | + +-----------------------------------+-------------------------------------------------+ + | Storm | No authentication | + +-----------------------------------+-------------------------------------------------+ + | YARN | - Web UI: no authentication | + | | - Client: simple authentication | + +-----------------------------------+-------------------------------------------------+ + | ZooKeeper | Simple authentication | + +-----------------------------------+-------------------------------------------------+ + +The authentication modes are as follows: + +- Simple authentication: When the client connects to the server, the client automatically authenticates the user (for example, the OS user **root** or **omm**) by default. The authentication is imperceptible to the administrator or service user, which does not require **kinit**. +- Username and password authentication: Use the username and password of human-machine users in the cluster for authentication. +- No authentication: Any user can access the server by default. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/default_permission_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/default_permission_information.rst new file mode 100644 index 0000000..36005a6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/default_permission_information.rst @@ -0,0 +1,112 @@ +:original_name: admin_guide_000240.html + +.. _admin_guide_000240: + +Default Permission Information +============================== + +Role +---- + ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Default Role | Description | ++===================================+================================================================================================================================================================================================================================================================================+ +| Manager_administrator | Manager administrator who has all permissions for Manager. | +| | | +| | Manager administrators can create first-level tenants, create and modify user groups, and specify user permissions. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_operator | Manager operator who has all the permissions on the **Homepage**, **Cluster**, **Hosts**, and **O&M** tab pages. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_auditor | Manager auditor who has all permissions on the **Audit** tab page. | +| | | +| | Manager auditors can view and manage Manager system audit logs. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_viewer | Manager viewer who has the permission to view information about **Homepage**, **Cluster**, **Hosts**, **Alarm**, **Events**, and **System > Permission**. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_tenant | Manager tenant administrator. | +| | | +| | This role can create and manage sub-tenants for the non-leaf tenants to which the current user belongs. It has the permission to view alarms and events on **O&M > Alarm**. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| System_administrator | System administrator, this role has Manager system administrator rights and all services administrator rights. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| default | This role is the default role created for the **default** tenant. It has the management permissions on the Yarn component and the default queue. The default role of the default tenant that is not the first cluster to be installed is **c**\ **\ **\_default**. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_administrator_180 | FusionInsight Manager System administrator group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_auditor_181 | FusionInsight Manager system auditor group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_operator_182 | FusionInsight Manager system operator group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_viewer_183 | FusionInsight Manager system viewer group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| System_administrator_186 | System administrator group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_tenant_187 | Tenant system user group. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| default_1000 | This group is created for tenant. Internal system user group, which is used only between components. | ++-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _admin_guide_000240__section1031812876: + +User group +---------- + ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Type | Default User Group | Description | ++===============+====================+==================================================================================================================================================================================================================================================================================================================================================+ +| OS User Group | hadoop | Users added to this group are granted the permission to submit all Yarn queue tasks. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hadoopmanager | Users added to this user group can have the O&M manager rights of HDFS and Yarn. The O&M manager of HDFS can access the NameNode WebUI and perform active to standby switchover manually. The O&M manager of Yarn can access the ResourceManager WebUI, operate NodeManager nodes, refresh queues, and set node labels, but cannot submit tasks. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hive | Common user group. Hive users must belong to this user group. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hive1 | Common user group. Hive1 users must belong to this user group. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hive2 | Common user group. Hive2 users must belong to this user group. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hive3 | Common user group. Hive3 users must belong to this user group. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | hive4 | Common user group. Hive4 users must belong to this user group. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | kafka | Kafka common user group. A user in this group can access a topic only when a user in the kafkaadmin group grants the read and write permission of the topic to the user. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | kafkaadmin | Kafka administrator group. Users in this group have the rights to create, delete, authorize, read, and write all topics. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | kafkasuperuser | Topic read/write user group of Kafka. Users added to this group have the read and write permissions on all topics. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | storm | Users who are added to the storm user group can submit topologies and manage their own topologies. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | stormadmin | Users who are added to the stormadmin user group can have the storm administrator rights and can submit topologies and manage all topologies. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | supergroup | Users added to this user group can have the administrator rights of HBase, HDFS and Yarn and can use Hive. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | yarnviewgroup | Indicates the read-only user group of the Yarn task. Users in this user group can have the view permission on Yarn and MapReduce tasks. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | check_sec_ldap | Perform internal test on the active LDAP to see whether it works properly. This user group is generated randomly in a test and automatically deleted after the test is complete. Internal system user group, which is used only between components. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | compcommon | System internal group for accessing cluster system resources. All system users and system running users are added to this user group by default. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| OS User Group | wheel | Primary group of the FusionInsight internal running user omm. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| | ficommon | System common group that corresponds to **compcommon** for accessing cluster common resource files stored in the OS. | ++---------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + If the current cluster is not the cluster that is installed for the first time in FusionInsight Manager, the default user group name of all components except Manager in the cluster is **c**\ **\ \_ *default user group name*, for example, **c2_hadoop**. + +User +---- + +For details, see :ref:`User Account List `. + +Service-related User Security Parameters +---------------------------------------- + +- **HDFS** + + The **dfs.permissions.superusergroup** parameter specifies the administrator group with the highest permission on the HDFS. The default value is **supergroup**. + +- **Spark2x and Corresponding Multi-Instances** + + The **spark.admin.acls** parameter specifies the administrator list of the Spark2x. Members in the list are authorized to manage all Spark tasks. Users not added in the list cannot manage all Spark tasks. The default value is **admin**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/fusioninsight_manager_security_functions.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/fusioninsight_manager_security_functions.rst new file mode 100644 index 0000000..14f16ac --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/fusioninsight_manager_security_functions.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000241.html + +.. _admin_guide_000241: + +FusionInsight Manager Security Functions +======================================== + +You can query and set user rights data through the following FusionInsight Manager modules: + +- User management: Users can be added, deleted, modified, queried, bound to user groups, and assigned with roles. + + For details, see :ref:`Managing Users `. + +- User group management: User groups can be added, deleted, modified, queried, and bound to roles. + + For details, see :ref:`Managing User Groups `. + +- Role management: Roles can be added, deleted, modified, queried, and assigned with the resource access rights of one or multiple components. + + For details, see :ref:`Managing Roles `. + +- Tenant management: Tenants can be added, deleted, modified, queried, and bound to component resources. FusionInsight generates a role for each tenant to facilitate management. If a tenant is assigned with the rights of some resources, its corresponding role also has these rights. + + For details, see :ref:`Tenant Resources `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/index.rst new file mode 100644 index 0000000..73f11e5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/index.rst @@ -0,0 +1,26 @@ +:original_name: admin_guide_000234.html + +.. _admin_guide_000234: + +Security Overview +================= + +- :ref:`Right Model ` +- :ref:`Right Mechanism ` +- :ref:`Authentication Policies ` +- :ref:`Permission Verification Policies ` +- :ref:`User Account List ` +- :ref:`Default Permission Information ` +- :ref:`FusionInsight Manager Security Functions ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + right_model + right_mechanism + authentication_policies + permission_verification_policies + user_account_list + default_permission_information + fusioninsight_manager_security_functions diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/permission_verification_policies.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/permission_verification_policies.rst new file mode 100644 index 0000000..42554b5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/permission_verification_policies.rst @@ -0,0 +1,77 @@ +:original_name: admin_guide_000238.html + +.. _admin_guide_000238: + +Permission Verification Policies +================================ + +Security Mode +------------- + +After a user is authenticated by the big data platform, the system determines whether to verify the user's permission based on the actual permission management configuration to ensure that the user has limited or all permission on resources. If the user does not have the permission for accessing cluster resources, the system administrator must grant the required permission to the user. Otherwise, the user fails to access the resources. The cluster provides permission verification capabilities in both security mode and normal mode. The specific permission items of the components are the same in the two modes. + +By default, the Ranger service is installed and Ranger authentication is enabled for a newly installed cluster in security mode. You can set fine-grained security access policies for accessing component resources through the permission plug-in of the component. If Ranger authentication is not required, administrators can manually disable it on the service page. After Ranger authentication is disabled, the system continues to perform permission control based on the role model of FusionInsight Manager when accessing component resources. + +In a cluster in security mode, the following components support Ranger authentication: HDFS, YARN, Kafka, Hive, HBase, Storm,, Impala, and Spark2x. + +For a cluster upgraded from an earlier version, Ranger authentication is not used by default when users access component resources. The administrator can manually enable Ranger authentication after installing Ranger. + +By default, all components in the cluster of the security edition authenticate access. The authentication function cannot be disabled. + +Normal Mode +----------- + +Different components in a normal cluster use their own native open-source authentication behavior. :ref:`Table 1 ` lists detailed permission verification modes. + +In a normal cluster, Ranger supports permission control on component resources based on OS users. The following components support Ranger authentication: HBase, HDFS, Hive, Spark2x, and YARN. + +.. _admin_guide_000238__ta45bf66853314ecc850b8e6d38b236e9: + +.. table:: **Table 1** Component permission verification modes in normal clusters + + +------------+-------------------------+------------------------------------------------+ + | Service | Permission Verification | Permission Verification Enabling and Disabling | + +============+=========================+================================================+ + | ClickHouse | Required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Flume | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | HBase | Not required | Supported | + +------------+-------------------------+------------------------------------------------+ + | HDFS | Required | Supported | + +------------+-------------------------+------------------------------------------------+ + | Hive | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Hue | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Kafka | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Loader | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | MapReduce | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Oozie | Required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Spark2x | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | Storm | Not required | Not supported | + +------------+-------------------------+------------------------------------------------+ + | YARN | Not required | Supported | + +------------+-------------------------+------------------------------------------------+ + | ZooKeeper | Required | Supported | + +------------+-------------------------+------------------------------------------------+ + +Condition Priorities of the Ranger Permission Policy +---------------------------------------------------- + +When configuring a permission policy for a resource, you can configure Allow Conditions, Exclude from Allow Conditions, Deny Conditions, and Exclude from Deny Conditions for the resource, to meet unexpected requirements in different scenarios. + +The priorities of different conditions are listed in descending order: Exclude from Deny Conditions > Deny Conditions > Exclude from Allow Conditions > Allow Conditions + +The following figure shows the process of determining condition priorities. If the component resource request does not match the permission policy in Ranger, the system rejects the access by default. However, for HDFS and Yarn, the system delivers the decision to the access control layer of the component for determination. + +|image1| + +For example, if you want to grant the read and write permissions of the **FileA** folder to the **groupA** user group, but the user in the group is not **UserA**, you can add an allowed condition and an exception condition. + +.. |image1| image:: /_static/images/en-us_image_0265768517.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_mechanism.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_mechanism.rst new file mode 100644 index 0000000..f3b9688 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_mechanism.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000236.html + +.. _admin_guide_000236: + +Right Mechanism +=============== + +FusionInsight adopts the Lightweight Directory Access Protocol (LDAP) to store data of users and user groups. Information about role definitions is stored in the relational database and the mapping between roles and rights is saved in components. + +FusionInsight uses Kerberos for unified authentication. + +The verification process of user rights is as follows: + +#. A client (a user terminal or FusionInsight component service) invokes the FusionInsight authentication interface. +#. FusionInsight uses the login username and password for Kerberos authentication. +#. If the authentication succeeds, the client sends a request for accessing the server (a FusionInsight component service). +#. The server finds the user group and role to which the login user belongs. +#. The server obtains all rights of the user group and the role. +#. The server checks whether the client has the right to access the resources it applies for. + +**Example (RBAC):** + +There are three files in HDFS, that is, fileA, fileB, and fileC. + +- roleA has read and write right for fileA, and roleB has the read right for fileB. +- groupA is bound to roleA, and groupB is bound to roleB. +- userA belongs to groupA and roleB, and userB belongs to groupB. + +When userA successfully logs in to the system and accesses the HDFS: + +#. HDFS obtains the role (roleB) to which userA is bound. +#. HDFS also obtains the role (roleA) to which the user group of userA is bound. +#. In this case, userA has all the rights of roleA and roleB. +#. As a result, userA has read and write rights for fileA, has the read right on fileB, and has no right for fileC. + +Similarly, when userB successfully logs in to the system and accesses the HDFS: + +#. userB only has the rights of roleB. +#. As a result, userB has the read right on fileB, and has no rights for fileA and fileC. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_model.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_model.rst new file mode 100644 index 0000000..2c61b99 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/right_model.rst @@ -0,0 +1,69 @@ +:original_name: admin_guide_000235.html + +.. _admin_guide_000235: + +Right Model +=========== + +Role-based Access Control +------------------------- + +FusionInsight adopts the role-based access control (RBAC) mode to manage rights on the big data system. It integrates the right management functions of the components to centrally manage rights. Common users are shielded from internal right management details, and the right management operations are simplified for administrators, improving right management usability and user experience. + +The right model of FusionInsight consists four parts, that is users, user groups, roles, and rights. + + +.. figure:: /_static/images/en-us_image_0263899538.png + :alt: **Figure 1** Right model + + **Figure 1** Right model + +- **Right** + + Right, which is defined by components, allows users to access a certain resource of one component. Different components have different rights for their resources. + + For example: + + - HDFS provides read, write, and execute permissions on files. + - HBase provides create, read, and write permissions on tables. + +- **Role** + + Role is a collection of component rights. Each role can have multiple rights of multiple components. Different roles can have the rights of a resource of one component. + +- **User group** + + User group is a collection of users. When a user group is bound to a role, users in this group obtain the rights defined by the role. + + Different user groups can be associated with the same role. A user group can also be associated with no role, and this user group does not have the rights of any component resources. + + .. note:: + + In some components, the system grants related rights to specific user groups by default. + +- **User** + + A user is a visitor to the system. Each user has the rights of the user group and role associated with the user. Users need to be added to the user group or associated with roles to obtain the corresponding rights. + +Policy-based Access Control +--------------------------- + +The Ranger component uses policy-based access control (PBAC) to manage rights and implement fine-grained data access control on components such as HDFS, Hive, and HBase. + +.. note:: + + The component supports only one right control mechanism. After the Ranger right control policy is enabled for the component, the right on the component in the role created on FusionInsight Manager becomes invalid (The ACL rules of HDFS and Yarn still take effect). You need to add a policy on the Ranger management page to grant rights on resources. + +The Ranger right model consists of multiple right policies. A right policy consists of the following parts: + +- Resource + + Resources are provided by components and can be accessed by users, such as HDFS files or folders, queues in Yarn, and databases, tables, and columns in Hive. + +- User + + A User is a visitor to the system. The rights of each user are obtained based on the policy associated with the user. Information about users, user groups, and roles in the LDAP is periodically synchronized to the Ranger. + +- Permission + + In a policy, you can configure various access conditions for resources, such as file read and write, permission conditions, rejection conditions, and exception conditions. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/user_account_list.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/user_account_list.rst new file mode 100644 index 0000000..72279df --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_overview/user_account_list.rst @@ -0,0 +1,573 @@ +:original_name: admin_guide_000239.html + +.. _admin_guide_000239: + +User Account List +================= + +User Classification +------------------- + +The MRS cluster provides the following three types of users. The system administrator needs to periodically change the passwords. It is not recommended to use the default passwords. + +.. note:: + + This section describes the default users in the MRS cluster. + ++-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| User Type | Description | ++===================================+====================================================================================================================================================================================================================================================================================================================================================+ +| System users | - User created on FusionInsight Manager for O&M and service scenarios. There are two types of users: | +| | | +| | - **Human-machine** user: used in scenarios such as FusionInsight Manager O&M and operations on a component client. When creating a user of this type, you need to set password and confirm password by referring to :ref:`Creating a User `. | +| | - **Machine-machine** user: used for system application development. | +| | | +| | - User who runs OMS processes | ++-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Internal system users | Internal user to perform Kerberos authentication, process communications, save user group information, and associate user permissions. It is recommended that internal system users not be used in O&M scenarios. Operations can be performed as user **admin** or another user created by the system administrator based on service requirements. | ++-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Database users | - User who manages OMS database and accesses data | +| | - User who runs service components (Hue, Hive, Loader, Oozie, Ranger, and DBService) in the database. | ++-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +System Users +------------ + +.. note:: + + - User **root** of the OS is required, the password of user **root** on all nodes must be the same. + - User **Idap** of the OS is required. Do not delete this account. Otherwise, the cluster may not work properly. The OS administrator maintains the password management policies. + ++----------------------+-------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ +| User Type | Username | Initial Password | Description | Password Change Method | ++======================+=============+=======================+==============================================================================================================================================================================================================================================================================+====================================================================================+ +| System administrator | admin | User-defined password | FusionInsight Manager administrator. | For details, see :ref:`Changing the Password for User admin `. | +| | | | | | +| | | | .. note:: | | +| | | | | | +| | | | By default, user **admin** does not have the management permission on other components. For example, when accessing the native UI of a component, the user fails to access the complete component information due to insufficient management permission on the component. | | ++----------------------+-------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ +| Node OS user | ommdba | Random password | User that creates the system database. This user is an OS user generated on the management node and does not require a unified password. This account cannot be used for remote login. | For details, see :ref:`Changing the Password for an OS User `. | ++----------------------+-------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ +| | omm | Bigdata123@ | Internal running user of the system. This user is an OS user generated on all nodes and does not require a unified password. | | ++----------------------+-------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ + +Internal System Users +--------------------- + ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| User Type | Default User | Initial Password | Description | Password Change Method | ++============================+===========================================+==================================+==================================================================================================================================================================================================================================================================================+=======================================================================================================================================+ +| Kerberos administrator | kadmin/admin | Admin@123 | Used to add, delete, modify, and query user accounts on Kerberos. | For details, see :ref:`Changing the Password for the Kerberos Administrator `. | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| OMS Kerberos administrator | kadmin/admin | Admin@123 | Used to add, delete, modify, and query user accounts on OMS Kerberos. | For details, see :ref:`Changing the Password for the OMS Kerberos Administrator `. | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| LDAP administrator | cn=root,dc=hadoop,dc=com | LdapChangeMe@123 | Used to add, delete, modify, and query the user account information on LDAP. | For details, see :ref:`Changing the Passwords of the LDAP Administrator and the LDAP User (Including OMS LDAP) `. | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| OMS LDAP administrator | cn=root,dc=hadoop,dc=com | LdapChangeMe@123 | Used to add, delete, modify, and query the user account information on OMS LDAP. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| LDAP user | cn=pg_search_dn,ou=Users,dc=hadoop,dc=com | Randomly generated by the system | Used to query information about users and user groups on LDAP. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| OMS LDAP user | cn=pg_search_dn,ou=Users,dc=hadoop,dc=com | Randomly generated by the system | Used to query information about users and user groups on OMS LDAP. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| LDAP administrator account | cn=krbkdc,ou=Users,dc=hadoop,dc=com | LdapChangeMe@123 | Used to query Kerberos component authentication account information. | For details, see :ref:`Changing the Password for the LDAP Administrator `. | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | cn=krbadmin,ou=Users,dc=hadoop,dc=com | LdapChangeMe@123 | Used to add, delete, modify, and query Kerberos component authentication account information. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| Component running user | hdfs | Hdfs@123 | This user is the HDFS system administrator and has the following permissions: | For details, see :ref:`Changing the Password for a Component Running User `. | +| | | | | | +| | | | #. File system operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | - Views and sets disk quotas for users. | | +| | | | | | +| | | | #. HDFS management operation permissions: | | +| | | | | | +| | | | - Views the web UI status. | | +| | | | - Views and sets the active and standby HDFS status. | | +| | | | - Enters and exits the HDFS in security mode. | | +| | | | - Checks the HDFS file system. | | +| | | | | | +| | | | #. Logs in to the FTP service page. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase | Hbase@123 | This user is the HBase and HBase1 to HBase4 system administrator and has the following permissions: | | +| | | | | | +| | | | - Cluster management permission: Performs **Enable** and **Disable** operations on tables to trigger MajorCompact and ACL operations. | | +| | | | - Grants and revokes permissions, and shuts down the cluster. | | +| | | | - Table management permission: Creates, modifies, and deletes tables. | | +| | | | - Data management permission: Reads data in tables, column families, and columns. | | +| | | | - Logs in to the HMaster web UI. | | +| | | | - Logs in to the FTP service page. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mapred | Mapred@123 | This user is the MapReduce system administrator and has the following permissions: | | +| | | | | | +| | | | - Submits, stops, and views the MapReduce tasks. | | +| | | | - Modifies the Yarn configuration parameters. | | +| | | | - Logs in to the FTP service page. | | +| | | | - Logs in to the Yarn web UI. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zookeeper | ZooKeeper@123 | This user is the ZooKeeper system administrator and has the following permissions: | | +| | | | | | +| | | | - Adds, deletes, modifies, and queries all nodes in ZooKeeper. | | +| | | | - Modifies and queries quotas of all nodes in ZooKeeper. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | rangeradmin | Rangeradmin@123 | This user has the Ranger system management permissions and user permissions: | | +| | | | | | +| | | | - Ranger web UI management permission | | +| | | | - Management permission of each component that uses Ranger authentication | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | rangerauditor | Rangerauditor@123 | Default audit user of the Ranger system. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive | Hive@123 | This user is the Hive system administrator and has the following permissions: | | +| | | | | | +| | | | #. Hive administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | +| | | | #. Ranger policy management permission | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive1 | Hive1@123 | This user is the Hive1 system administrator and has the following permissions: | | +| | | | | | +| | | | #. Hive1 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | +| | | | #. Ranger policy management permission | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive2 | Hive2@123 | This user is the Hive2 system administrator and has the following permissions: | | +| | | | | | +| | | | #. Hive2 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | +| | | | #. Ranger policy management permission | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive3 | Hive3@123 | This user is the Hive3 system administrator and has the following permissions: | | +| | | | | | +| | | | #. Hive3 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | +| | | | #. Ranger policy management permission | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive4 | Hive4@123 | This user is the Hive4 system administrator and has the following permissions: | | +| | | | | | +| | | | #. Hive4 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | +| | | | #. Ranger policy management permission | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kafka | Kafka@123 | This user is the Kafka system administrator and has the following permissions: | | +| | | | | | +| | | | - Creates, deletes, produces, and consumes the topic; modifies the topic configuration. | | +| | | | - Controls the cluster metadata, modifies the configuration, migrates the replica, elects the leader, and manages ACL. | | +| | | | - Submits, queries, and deletes the consumer group offset. | | +| | | | - Queries the delegation token. | | +| | | | - Queries and submits the transaction. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm | Admin@123 | Storm system administrator | | +| | | | | | +| | | | User permission: Submits Storm tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | rangerusersync | Randomly generated by the system | Synchronizes users and internal users of user groups. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | rangertagsync | Randomly generated by the system | Internal user for synchronizing tags. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | oms/manager | Randomly generated by the system | Controller and NodeAgent authentication user. The user has the permission on the **supergroup** group. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | backup/manager | Randomly generated by the system | User for running backup and restoration tasks. The user has the permission on the **supergroup**, **wheel**, and **ficommon** groups. After cross-system mutual trust is configured, the user has the permission to access data in the HDFS, HBase, Hive, and ZooKeeper systems. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hdfs/hadoop.\ ** | Randomly generated by the system | This user is used to start the HDFS and has the following permissions: | | +| | | | | | +| | | | #. File system operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | - Views and sets disk quotas for users. | | +| | | | | | +| | | | #. HDFS management operation permissions: | | +| | | | | | +| | | | - Views the web UI status. | | +| | | | - Views and sets the active and standby HDFS status. | | +| | | | - Enters and exits the HDFS in security mode. | | +| | | | - Checks the HDFS file system. | | +| | | | | | +| | | | #. Logs in to the FTP service page. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mapred/hadoop.\ ** | Randomly generated by the system | This user is used to start the MapReduce and has the following permissions: | | +| | | | | | +| | | | - Submits, stops, and views the MapReduce tasks. | | +| | | | - Modifies the Yarn configuration parameters. | | +| | | | - Logs in to the FTP service page. | | +| | | | - Logs in to the Yarn web UI. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mr_zk/hadoop.\ ** | Randomly generated by the system | Used for MapReduce to access ZooKeeper. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the HBase system startup. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase/zkclient.\ ** | Randomly generated by the system | User for HBase to perform ZooKeeper authentication in a security mode cluster. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | thrift/hadoop.\ ** | Randomly generated by the system | ThriftServer system startup user. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | thrift/** | Randomly generated by the system | User for the ThriftServer system to access HBase. This user has the read, write, execution, creation, and administration permission on all NameSpaces and tables of HBase. ** indicates the name of the host where the ThriftServer node is installed in the cluster. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the Hive system startup. The user permissions are as follows: | | +| | | | | | +| | | | #. Hive administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive1/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the Hive1 system startup. The user permissions are as follows: | | +| | | | | | +| | | | #. Hive1 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive2/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the Hive2 system startup. The user permissions are as follows: | | +| | | | | | +| | | | #. Hive2 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive3/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the Hive3 system startup. The user permissions are as follows: | | +| | | | | | +| | | | #. Hive3 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive4/hadoop.\ ** | Randomly generated by the system | User for the authentication between internal components during the Hive4 system startup. The user permissions are as follows: | | +| | | | | | +| | | | #. Hive4 administrator permissions: | | +| | | | | | +| | | | - Creates, deletes, and modifies a database. | | +| | | | - Creates, queries, modifies, and deletes a table. | | +| | | | - Queries, inserts, and uploads data. | | +| | | | | | +| | | | #. HDFS file operation permissions: | | +| | | | | | +| | | | - Views, modifies, and creates files. | | +| | | | - Views and creates directories. | | +| | | | - Views and modifies the groups where files belong. | | +| | | | | | +| | | | #. Submits and stops the MapReduce tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | loader/hadoop.\ ** | Randomly generated by the system | User for Loader system startup and Kerberos authentication | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | HTTP/** | Randomly generated by the system | Used to connect to the HTTP interface of each component. ** indicates the host name of a node in the cluster. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hue | Randomly generated by the system | User for Hue system startup, Kerberos authentication, and HDFS and Hive access | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | flume | Randomly generated by the system | User for Flume system startup and HDFS and Kafka access. The user has read and write permission of the HDFS directory **/flume**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | flume_server | Randomly generated by the system | User for Flume system startup and HDFS and Kafka access. The user has read and write permission of the HDFS directory **/flume**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x/hadoop.\ ** | Randomly generated by the system | This user is the Spark2x system administrator and has the following user permissions: | | +| | | | | | +| | | | 1. Starts the Spark2x service. | | +| | | | | | +| | | | 2. Submits Spark2x tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark_zk/hadoop.\ ** | Randomly generated by the system | Used for Spark2x to access ZooKeeper. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x1/hadoop.\ ** | Randomly generated by the system | This user is the Spark2x1 system administrator and has the following user permissions: | | +| | | | | | +| | | | #. Starts the Spark2x1 service. | | +| | | | #. Submits Spark2x tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x2/hadoop.\ ** | Randomly generated by the system | This user is the Spark2x2 system administrator and has the following user permissions: | | +| | | | | | +| | | | #. Starts the Spark2x2 service. | | +| | | | #. Submits Spark2x tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x3/hadoop.\ ** | Randomly generated by the system | This user is the Spark2x3 system administrator and has the following user permissions: | | +| | | | | | +| | | | #. Starts the Spark2x3 service. | | +| | | | #. Submits Spark2x tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x4/hadoop.\ ** | Randomly generated by the system | This user is the Spark2x4 system administrator and has the following user permissions: | | +| | | | | | +| | | | #. Starts the Spark2x4 service. | | +| | | | #. Submits Spark2x tasks. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zookeeper/hadoop.\ ** | Randomly generated by the system | ZooKeeper system startup user. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zkcli/hadoop.\ ** | Randomly generated by the system | ZooKeeper server login user. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | oozie | Randomly generated by the system | User for Oozie system startup and Kerberos authentication. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kafka/hadoop.\ ** | Randomly generated by the system | Used for security authentication of Kafka. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm/hadoop.\ ** | Randomly generated by the system | Storm system startup user. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm_zk/hadoop.\ ** | Randomly generated by the system | Used for the Worker process to access ZooKeeper. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | flink/hadoop.\ ** | Randomly generated by the system | Internal user of the Flink service. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | check_ker_M | Randomly generated by the system | User who performs a system internal test about whether the Kerberos service is normal. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | tez | Randomly generated by the system | User for TezUI system startup, Kerberos authentication, and access to Yarn | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | K/M | Randomly generated by the system | Kerberos internal functional user. This user cannot be deleted, and its password cannot be changed. This internal account can only be used on nodes where Kerberos service is installed. | None | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kadmin/changepw | Randomly generated by the system | | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kadmin/history | Randomly generated by the system | | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | krbtgt\ ** | Randomly generated by the system | | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| LDAP user | admin | None | FusionInsight Manager administrator. | The LDAP user cannot log in to the system, and the password cannot be changed. | +| | | | | | +| | | | The primary group is **compcommon**, which does not have the group permission but has the permission of the **Manager_administrator** role. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | backup | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | backup/manager | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | oms | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | oms/manager | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | clientregister | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zookeeper | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zookeeper/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zkcli | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | zkcli/hadoop.<*System domain name*> | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | flume | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | flume_server | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hdfs | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hdfs/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mapred | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mapred/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mr_zk | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | mr_zk/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hue | | The primary group is **supergroup**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive | | The primary group is **hive**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive/hadoop.\ ** | | The primary group is **hive**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive1 | | The primary group is **hive1**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive1/hadoop.\ ** | | The primary group is **hive1**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive2 | | The primary group is **hive2**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive2/hadoop.\ ** | | The primary group is **hive2**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive3 | | The primary group is **hive3**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive3/hadoop.\ ** | | The primary group is **hive3**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive4 | | The primary group is **hive4**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hive4/hadoop.\ ** | | The primary group is **hive4**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | thrift | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | thrift/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | oozie | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | hbase/zkclient.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | loader | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | loader/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark_zk | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x1 | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x1/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x2 | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x2/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x4 | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | spark2x4/hadoop.\ ** | | The primary group is **hadoop**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kafka | | The primary group is **kafkaadmin**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kafka/hadoop.\ ** | | The primary group is **kafkaadmin**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm | | The primary group is **stormadmin**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm/hadoop.\ ** | | The primary group is **stormadmin**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm_zk | | The primary group is **storm**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | storm_zk/hadoop.\ ** | | The primary group is **storm**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | kms/hadoop | | The primary group is **kmsadmin**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | knox | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ +| | executor | | The primary group is **compcommon**. | | ++----------------------------+-------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + Log in to FusionInsight Manager, choose **System** > **Permission** > **Domain and Mutual Trust**, and check the value of **Local Domain**. In the preceding table, all letters in the system domain name contained in the username of the system internal user are lowercase letters. + + For example, if **Local Domain** is set to **9427068F-6EFA-4833-B43E-60CB641E5B6C.COM**, the username of default HDFS startup user is **hdfs/hadoop.9427068f-6efa-4833-b43e-60cb641e5b6c.com**. + +Database Users +-------------- + +The system database users include OMS database users and DBService database users. + ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| Database Type | Default User | Initial Password | Description | Password Change Method | ++====================+==============+===================+============================================================================================================================+==================================================================================================================+ +| OMS database | ommdba | dbChangeMe@123456 | OMS database administrator who performs maintenance operations, such as creating, starting, and stopping. | For details, see :ref:`Changing the Password of the OMS Database Administrator `. | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | omm | ChangeMe@123456 | User for accessing OMS database data | For details, see :ref:`Changing the Password for the Data Access User of the OMS Database `. | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| DBService database | omm | dbserverAdmin@123 | Administrator of the GaussDB database in the DBService component | For details, see :ref:`Changing the Password for a Component Database User `. | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive | HiveUser@ | User for Hive to connect to the DBService database **hivemeta**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive1 | HiveUser@ | User for Hive1 to connect to the DBService database **hivemeta1**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive2 | HiveUser@ | User for Hive2 to connect to the DBService database **hivemeta2**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive3 | HiveUser@ | User for Hive3 to connect to the DBService database **hivemeta3**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive4 | HiveUser@ | User for Hive4 to connect to the DBService database **hivemeta4**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hive\ *NN* | HiveUser@ | User for **Hive-**\ *N* to connect to the DBService database **hive**\ *N*\ **meta** when multiple services are installed. | | +| | | | | | +| | | | For example, the user for **Hive-1** to connect to the DBService database **hive1meta** is **hive11**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | hue | HueUser@123 | User for Hue to connect to the DBService database **hue**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | sqoop | SqoopUser@ | User for Loader to connect to the DBService database **sqoop**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | sqoop\ *N* | SqoopUser@ | User for **Loader-**\ *N* to connect to the DBService database **sqoop**\ *N* when multiple services are installed. | | +| | | | | | +| | | | For example, the user for **Loader-1** to connect to the DBService database **sqoop1** is **sqoop1**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | oozie | OozieUser@ | User for Oozie to connect to the DBService database **oozie**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | oozie\ *N* | OozieUser@ | User for **Oozie-**\ *N* to connect to the DBService database **oozie**\ *N* when multiple services are installed. | | +| | | | | | +| | | | For example, the user for **Oozie-1** to connect to the DBService database **oozie1** is **oozie1**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ +| | rangeradmin | Admin12! | User for Ranger to connect to the DBService database **ranger**. | | ++--------------------+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_statement.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_statement.rst new file mode 100644 index 0000000..6c6fa0c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/security_management/security_statement.rst @@ -0,0 +1,29 @@ +:original_name: admin_guide_000315.html + +.. _admin_guide_000315: + +Security Statement +================== + +JDK Usage Statement +------------------- + +MRS MRS cluster is a big data cluster that provides users with distributed data analysis and computing capabilities. The built-in JDK of MRS MRS is OpenJDK, which is used in the following scenarios: + +- Platform service running and maintenance +- Linux client operations, including service submission and application O&M + +JDK Risk Description +-------------------- + +The system performs permission control on the built-in JDK. Only users in the related group of the FusionInsight platform can access the JDK. In addition, the platform is deployed on a customer's intranet. Therefore, the security risk is low. + +JDK Hardening +------------- + +For details about how to harden the JDK, see "Hardening JDK" in :ref:`Hardening Policies `. + +Public IP Addresses in Hue +-------------------------- + +Hue uses the test cases of third-party packages, such as **ipadrress**, **requests**, and **Django**, and uses the public IP addresses in the comments of the test cases. However, these public IP addresses are not involved when Hue provides services, and the Hue configuration file does not involve these public IP addresses. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/index.rst new file mode 100644 index 0000000..a2b6914 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/index.rst @@ -0,0 +1,14 @@ +:original_name: admin_guide_000164.html + +.. _admin_guide_000164: + +Component Management +==================== + +- :ref:`Viewing Component Packages ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_component_packages diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/viewing_component_packages.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/viewing_component_packages.rst new file mode 100644 index 0000000..48c73c9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/component_management/viewing_component_packages.rst @@ -0,0 +1,25 @@ +:original_name: admin_guide_000165.html + +.. _admin_guide_000165: + +Viewing Component Packages +========================== + +Scenario +-------- + +A complete MRS cluster consists of multiple component packages. Before installing some services on FusionInsight Manager, check whether the component packages of those services have been installed. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **System** > **Component**. +#. On the **Installed Component** page, view all components. + + .. note:: + + In the **Platform Type** column, you can view the registered OS and platform type of the component. + +#. Click |image1| on the left of a component name to view the services and version numbers contained in the component. + +.. |image1| image:: /_static/images/en-us_image_0263899291.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_monitoring_metric_dumping.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_monitoring_metric_dumping.rst new file mode 100644 index 0000000..c8a7402 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_monitoring_metric_dumping.rst @@ -0,0 +1,142 @@ +:original_name: admin_guide_000156.html + +.. _admin_guide_000156: + +Configuring Monitoring Metric Dumping +===================================== + +Scenario +-------- + +The monitoring data reporting function writes the monitoring data collected in the system into a text file and uploads the file to a specified server in FTP or SFTP mode. + +Before using this function, you need to perform related configurations on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Interconnection** > **Upload Performance Data**. + +#. Toggle on **Upload Performance Data**. + + The performance data upload service is disabled by default. |image1| indicates that the service is enabled. + +#. Set the upload parameters according to :ref:`Table 1 `. + + .. _admin_guide_000156__table36700465: + + .. table:: **Table 1** Upload parameters + + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================+===============================================================================================================================================================================================================================+ + | FTP IP Address Mode | Specifies the server IP address mode. This parameter is mandatory. The value can be **IPV4** or **IPV6**. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP IP Address | Specifies the IP address of the FTP server for storing monitoring files after the monitoring metric data is interconnected. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Port | Specifies the port for connecting to the FTP server. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Username | Specifies the username for logging in to the FTP server. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Password | Specifies the password for logging in to the FTP server. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Save Path | Specifies the path for storing monitoring files on the FTP server. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dump Interval (second) | Specifies the interval at which monitoring files are periodically stored on the FTP server, in seconds. This parameter is mandatory. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dump Mode | Specifies the protocol used for sending monitoring files. This parameter is mandatory. The value can be **SFTP** or **FTP**. You are advised to use the SFTP mode based on SSH v2. Otherwise, security risks may be incurred. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SFTP Service Public Key | Specifies the public key of the FTP server. This parameter is optional. It is valid only when **Dump Mode** is set to **SFTP**. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + + .. note:: + + If the dump mode is SFTP and the public key of the SFTP service is empty, the system displays a security risk warning. You need to evaluate the security risk and then save the configuration. + +Data Format +----------- + +After the configuration is complete, the monitoring data reporting function periodically writes monitoring data in the cluster to text files and reports the files to the corresponding FTP/SFTP service based on the configured reporting period. + +- Principles for generating monitoring files + + - The monitoring metrics are written to files generated every 30, 60, and 300 seconds based on the metric collection period. + + 30s: real-time metrics that are collected every 30s by default + + 60s: real-time metrics that are collected every 60s by default + + 300s: all metrics that are not collected every 30s or 60s + + - File name format: *metric_{Interval}_{File creation time YYYYMMDDHHMMSS}.log* + + Example: **metric_60_20160908085915.log** + + **metric_300_20160908085613.log** + +- Monitoring file content + + - Format of monitoring files: + + "Cluster ID|Cluster name|Displayed name|Service name|Metric ID|Collection time|Collection host@m@Sub-metric|Unit|Metric value", where fields are separated using vertical bars (|). For example: + + .. code-block:: + + 1|xx1|Host|Host|10000413|2019/06/18 10:05:00|189-66-254-146|KB/s|309.910 + 1|xx1|Host|Host|10000413|2019/06/18 10:05:00|189-66-254-152|KB/s|72.870 + 2|xx2|Host|Host|10000413|2019/06/18 10:05:00|189-66-254-163|KB/s|100.650 + + Note: The actual files are not in that format. + + - Interval for uploading monitoring files: + + The interval for uploading monitoring files can be set using the **Dump Interval (second)** parameter on the page. Currently, the interval can range from **30** to **300**. After the configuration is complete, the system periodically uploads files to the corresponding FTP/SFTP server at the specified interval. + +- Monitoring metric description file + + - Metric set file + + The metric set file **all-shown-metric-zh_CN** contains detailed information about all metrics. After obtaining the metric IDs from the files reported by the third-party system, you can query details about the metrics from the metric set file. + + Location of the metric set file: + + Active and standby OMS nodes: {*FusionInsight installation path*} **/om-server/om/etc/om/all-shown-metric-zh_CN** + + Content of the metric set file: + + .. code-block:: + + Real-Time Metric ID,5-Minute Metric ID,Metric Name,Metric Collection Period (s),Collected by Default,Service Belonged To,Role Belonged To + 00101,10000101,JobHistoryServer non-heap memory usage,30,false,Mapreduce,JobHistoryServer + 00102,10000102,JobHistoryServer non-heap memory allocation volume,30,false,Mapreduce,JobHistoryServer + 00103,10000103,JobHistoryServer heap memory usage,30,false,Mapreduce,JobHistoryServer + 00104,10000104,JobHistoryServer heap memory allocation volume,30,false,Mapreduce,JobHistoryServer + 00105,10000105,Number of blocked threads,30,false,Mapreduce,JobHistoryServer + 00106,10000106,Number of running threads,30,false,Mapreduce,JobHistoryServer + 00107,10000107,GC time,30,false,Mapreduce,JobHistoryServer + 00110,10000110,JobHistoryServer CPU usage,30,false,Mapreduce,JobHistoryServer + ... + + - Field description of critical metrics + + **Real-Time Metric ID**: indicates the ID of the metric whose collection period is 30s or 60s. + + **5-Minute Metric ID**: indicates the ID of a 5-minute (300s) metric. + + **Metric Collection Period (s)**: indicates the collection period of real-time metrics. The value can be **30** or **60**. + + **Service Belonged To**: indicates the name of the service to which a metric belongs, for example, HDFS and HBase. + + **Role Belonged To**: indicates the name of the role to which a metric belongs, for example, JobServer and RegionServer. + + - Description + + For metrics whose collection period is 30s/60s, you can find the corresponding metric description by referring to the first column, that is, **Real-Time Metric ID**. + + For metrics whose collection period is 300s, you can find the corresponding metric description by referring to the second column, that is, **5-Minute Metric ID**. + +.. |image1| image:: /_static/images/en-us_image_0263899496.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_snmp_northbound_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_snmp_northbound_parameters.rst new file mode 100644 index 0000000..6f5b2fe --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_snmp_northbound_parameters.rst @@ -0,0 +1,82 @@ +:original_name: admin_guide_000154.html + +.. _admin_guide_000154: + +Configuring SNMP Northbound Parameters +====================================== + +Scenario +-------- + +If users need to view alarms and monitoring data of a cluster on the O&M platform, you can use Simple Network Management Protocol (SNMP) on FusionInsight Manager to report related data to the network management system (NMS). + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Interconnection** > **SNMP**. + +#. Toggle on **SNMP Service**. + + The SNMP service is disabled by default. |image1| indicates that the service is enabled. + +#. Set interconnection parameters according to :ref:`Table 1 `. + + .. _admin_guide_000154__en-us_topic_0046736864_tab01: + + .. table:: **Table 1** Interconnection parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================+ + | Version | Specifies the version of SNMP, which can be: | + | | | + | | - **V2C**: This is an earlier version with low security. | + | | - **V3**: This is a later version with higher security than SNMP V2C. | + | | | + | | SNMP V3 is recommended. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Local Port | Specifies the local port. The default value is **20000**. The value ranges from **1025** to **65535**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Read Community Name | Specifies the read-only community name. This parameter is available only when **Version** is set to **V2C**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Write Community Name | Specifies the write community name. This parameter is available only when **Version** is set to **V2C**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Security Username | Specifies the SNMP security username. This parameter is available only when **Version** is set to **V3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Authentication Protocol | Specifies the authentication protocol. This parameter is available only when **Version** is set to **V3**. SHA is recommended. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Authentication Password | Specifies the authentication password. This parameter is available only when **Version** is set to **V3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Used to confirm the authentication password. This parameter is available only when **Version** is set to **V3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Encryption Protocol | Specifies the encryption protocol. This parameter is available only when **Version** is set to **V3**. AES256 is recommended. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Encryption Password | Specifies the encryption password. This parameter is available only when **Version** is set to **V3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Used to confirm the encryption password. This parameter is available only when **Version** is set to **V3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + - The value of **Security Username** cannot contain repeated strings with the unit length as a common factor of 64 (such as 1, 2, 4, and 8), for example, **abab** and **abcdabcd**. + - The **Authentication Password** and **Encryption Password** must contain 8 to 16 characters, including at least three types of the following characters: uppercase letters, lowercase letters, digits, and special characters. The two passwords must be different. The two passwords cannot be the same as the security username or the reverse of the security username. + - For security purposes, periodically change the authentication password and encryption password when the SNMP protocol is used. + - If SNMP v3 is used, a security user will be locked after five consecutive authentication failures within 5 minutes. The user will be automatically unlocked 5 minutes later. + +#. Click **Create Trap Target** in the **Trap Target** area. In the displayed dialog box, set the following parameters: + + - **Target Symbol**: specifies the trap target ID, which is the ID of the NMS or host that receives traps. The value consists of 1 to 255 characters, including letters or digits. + - **Target IP Address Mode**: specifies the mode of the target IP address. The value can be **IPv4** or **IPv6**. + - **Target IP Address**: specifies the target IP address, which can communicate with the management plane IP address of the management node. + - **Target Port**: specifies the port receiving traps. The port number must be consistent with the peer end and ranges from 0 to 65535. + - **Trap Community Name**: This parameter is available only when **Version** is set to **V2C** and is used to report the community name. + + Click **OK**. + + The **Create Trap Target** dialog box is closed. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899496.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_syslog_northbound_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_syslog_northbound_parameters.rst new file mode 100644 index 0000000..27f1e85 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/configuring_syslog_northbound_parameters.rst @@ -0,0 +1,187 @@ +:original_name: admin_guide_000155.html + +.. _admin_guide_000155: + +Configuring Syslog Northbound Parameters +======================================== + +Scenario +-------- + +If users need to view alarms and events of a cluster on the unified alarm reporting platform, you can use the Syslog protocol on FusionInsight Manager to report related data to the alarm platform. + +.. important:: + + If the Syslog protocol is not encrypted, data may be stolen. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Interconnection** > **Syslog**. + +#. Toggle on **Syslog Service**. + + The Syslog service is disabled by default. |image1| indicates that the service is enabled. + +#. Set northbound parameters according to :ref:`Table 1 `. + + .. _admin_guide_000155__tba2589a9e61145faacec7d81c6eb3235: + + .. table:: **Table 1** Syslog interconnection parameters + + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter Area | Parameter | Description | + +===========================+====================================+=================================================================================================================================================================================================================================================================================================================+ + | Syslog Protocol | Server IP Address Mode | Specifies the IP address mode of the interconnected server. The value can be **IPV4** or **IPV6**. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Server IP Address | Specifies the IP address of the interconnected server. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Server Port | Specifies the port number for interconnection. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Protocol | Specifies the protocol type. The options are as follows: | + | | | | + | | | - **TCP** | + | | | - **UDP** | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Severity Level | Specifies the severity of the reported message. The options are as follows: | + | | | | + | | | - **Emergency** | + | | | - **Alert** | + | | | - **Critical** | + | | | - **Error** | + | | | - **Warning** | + | | | - **Notice** | + | | | - **Informational** (default value) | + | | | - **Debug** | + | | | | + | | | .. note:: | + | | | | + | | | **Severity Level** and **Facility** determine the priority of the sent message. | + | | | | + | | | **Priority** = **Facility** x 8 + **Severity Level** | + | | | | + | | | For details about the values of **Severity Level** and **Facility**, see :ref:`Table 2 `. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Facility | Specifies the module where the log is generated. For details about the available values of this parameter, see :ref:`Table 2 `. Default value **local use 0 (local0)** is recommended. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Identifier | Specifies the product ID. The default value is **FusionInsight Manager**. | + | | | | + | | | The identifier can contain a maximum of 256 characters, including letters, digits, underscores (_), periods (.), hyphens (-), spaces, and the following special characters: \| $ { } | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Report Message | Report Format | Specifies the message format of the alarm report. For details, see the help information on the page. | + | | | | + | | | The report format can contain a maximum of 1024 characters, including letters, digits, underscores (_), periods (.), hyphens (-), spaces, and the following special characters: \| $ { } | + | | | | + | | | .. note:: | + | | | | + | | | For details about each field in the report format, see :ref:`Table 3 `. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Alarm Type | Specifies the type of the alarm to be reported. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Alarm Severities | Specifies the level of the alarm to be reported. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Uncleared Alarm Reporting | Periodic Uncleared Alarm Reporting | Specifies whether to report uncleared alarms in a specified period. You can toggle on or off the function. The function is toggled off by default. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Report Interval (min) | Specifies the interval for periodically reporting uncleared alarms. This parameter is valid only when **Periodic Uncleared Alarm Reporting** is enabled. The default value is **15**, in minutes. The value ranges from **5** to **1440** (one day). | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Heartbeat Settings | Heartbeat Reporting | Specifies whether to periodically report Syslog heartbeat messages. You can toggle on or off the function. The function is toggled off by default. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Heartbeat Interval (minutes) | Specifies the interval for periodically reporting heartbeat messages. This parameter is valid only when **Heartbeat Reporting** is enabled. The default value is **15**, in minutes. The value ranges from **1** to **60**. | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Heartbeat Packet | Specifies the heartbeat message to be reported. This parameter is valid when **Heartbeat Reporting** is toggled on and cannot be left blank. The value can contain a maximum of 256 characters, including digits, letters, underscores (_), vertical bars (|), colons (:), spaces, commas (,), and periods (.). | + +---------------------------+------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + After the periodic heartbeat packet function is enabled, packets may be interrupted during automatic recovery of some cluster error tolerance (for example, active/standby OMS switchover). In this case, wait for automatic recovery. + +#. Click **OK**. + +Related Information +------------------- + +.. _admin_guide_000155__t05bc8524f4804c44982ceceae14e8388: + +.. table:: **Table 2** Numeric codes of **Severity Level** and **Facility** + + ============== ======================================== ============ + Severity Level Facility Numeric Code + ============== ======================================== ============ + **Emergency** kernel messages 0 + **Alert** user-level messages 1 + **Critical** mail system 2 + **Error** system daemons 3 + **Warning** security/authorization messages (note 1) 4 + **Notice** messages generated internally by syslog 5 + Informational line printer subsystem 6 + **Debug** network news subsystem 7 + ``-`` UUCP subsystem 8 + ``-`` clock daemon (note 2) 9 + ``-`` security/authorization messages (note 1) 10 + ``-`` FTP daemon 11 + ``-`` NTP subsystem 12 + ``-`` log audit (note 1) 13 + ``-`` log alert (note 1) 14 + ``-`` clock daemon (note 2) 15 + ``-`` local use 0~7 (local0 ~ local7) 16 to 23 + ============== ======================================== ============ + +.. _admin_guide_000155__t614aa61f080f47f0ba68c57aa68e7dc1: + +.. table:: **Table 3** Report format information fields + + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | Information Field | Description | + +===================================+============================================================================================+ + | dn | Cluster name | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | id | Alarm ID | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | name | Alam name | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | serialNo | Alarm serial number | + | | | + | | .. note:: | + | | | + | | The serial numbers of the fault alarms and the corresponding clear alarms are the same. | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | category | Alarm type. The options are as follows: | + | | | + | | - **0**: fault alarm | + | | - **1**: clear alarm | + | | - **2**: event | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | occurTime | Time when the alarm was generated | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | clearTime | Time when this alarm was cleared | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | isAutoClear | Whether an alarm is automatically cleared. The options are as follows: | + | | | + | | - **1**: yes | + | | - **0**: no | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | locationInfo | Location where the alarm was generated | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | clearType | Alarm clearance type. The options are as follows: | + | | | + | | - **-1**: not cleared | + | | - **0**: automatically cleared | + | | - **2**: manually cleared | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | level | Severity. The options are as follows: | + | | | + | | - **1**: critical alarm | + | | - **2**: major alarm | + | | - **3**: minor alarm | + | | - **4**: warning alarm | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | cause | Alarm cause | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | additionalInfo | Additional information | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + | object | Alarm object | + +-----------------------------------+--------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0263899496.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/index.rst new file mode 100644 index 0000000..b6c30e8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_interconnections/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000153.html + +.. _admin_guide_000153: + +Configuring Interconnections +============================ + +- :ref:`Configuring SNMP Northbound Parameters ` +- :ref:`Configuring Syslog Northbound Parameters ` +- :ref:`Configuring Monitoring Metric Dumping ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_snmp_northbound_parameters + configuring_syslog_northbound_parameters + configuring_monitoring_metric_dumping diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/index.rst new file mode 100644 index 0000000..ce63371 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000135.html + +.. _admin_guide_000135: + +Configuring Permissions +======================= + +- :ref:`Managing Users ` +- :ref:`Managing User Groups ` +- :ref:`Managing Roles ` +- :ref:`Security Policies ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_users/index + managing_user_groups + managing_roles + security_policies/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_roles.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_roles.rst new file mode 100644 index 0000000..5e2ffe3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_roles.rst @@ -0,0 +1,94 @@ +:original_name: admin_guide_000148.html + +.. _admin_guide_000148: + +Managing Roles +============== + +Scenario +-------- + +FusionInsight Manager supports a maximum of 5000 roles (including system built-in roles but excluding roles automatically created by tenants). Based on different service requirements, you need to create and manage different roles on FusionInsight Manager and perform authorization management for FusionInsight Manager and components using roles. + +Prerequisites +------------- + +- You have learned service requirements. +- You have logged in to FusionInsight Manager. + +.. _admin_guide_000148__section2095713912713: + +Creating a Role +--------------- + +#. Choose **System** > **Permission** > **Role**. + +#. On the displayed page, click **Create Role** and fill in **Role Name** and **Description**. + + The role name consists of 3 to 50 characters, including digits, letters, and underscores (_). It cannot be the same as an existing role name in the system. + +#. In the **Configure Resource Permission** area, click the cluster whose permissions are to be added and select service permissions for the role. + + When setting permissions for a component, enter a resource name in the search text box in the upper right corner and click the search icon to view the search result. + + The search result contains only directories, but not subdirectories. Search by keyword supports fuzzy match and is case-insensitive. + + .. note:: + + - For components (except HDFS and Yarn) for which Ranger authorization has been enabled, the permissions of non-default roles on Manager do not take effect. You need to configure Ranger policies to assign permissions to user groups. + - If the resource requests of HDFS and Yarn are beyond the Ranger policies, the ACL rules of the components still take effect. + - A maximum of 1000 permissions can be set for a component at a time. + +#. Click **OK**. + +Modifying Role Information +-------------------------- + +Locate the row that contains the target role and click **Modify**. + +Exporting Role Information +-------------------------- + +Click **Export All** to export all role information at a time in **TXT** or **CSV** format. + +The exported role information contains the role name, description, and whether the role is the default role. + +Deleting a Role +--------------- + +Locate the row that contains the target role and click **Delete**. To delete multiple roles in batches, select the target roles and click **Delete** above the role list. A role bound to a user cannot be deleted. To delete such a role, disassociate the role from the user by modifying the user first. + +Task Example (Creating a Manager Role) +-------------------------------------- + +#. Choose **System** > **Permission** > **Role**. + +#. On the displayed page, click **Create Role** and fill in **Role Name** and **Description**. + +#. In the **Configure Resource Permission** area, click **Manager** and set permissions for the role. + + Manager permissions: + + - Cluster + + - **view** permission: permission to view information on the **Cluster** page and view alarms and events under **O&M** > **Alarm**. + - **management** permission: permission for management on the **Cluster** and **O&M** pages. + + - User + + - **view** permission: permission to view information on pages under **System** > **Permission**. + - **management** permission: permission for management on pages under **System** > **Permission**. + + - Audit + + **management** permission: permission for management on the **Audit** page. + + - Tenant + + **management** permission: permission for management on the **Tenant** page and permission to view alarms and events under **O&M** > **Alarm**. + + - System + + **management** permission: permission for management on all pages except those under **Permission** on the **System** page and permission to view alarms and events under **O&M** > **Alarm**. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_user_groups.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_user_groups.rst new file mode 100644 index 0000000..5bba0f0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_user_groups.rst @@ -0,0 +1,65 @@ +:original_name: admin_guide_000147.html + +.. _admin_guide_000147: + +Managing User Groups +==================== + +Scenario +-------- + +FusionInsight Manager supports a maximum of 5000 user groups (including built-in user groups). You can create and manage different user groups based on service scenarios on FusionInsight Manager. A user group is bound to a role to obtain operation permissions. After a user is added to a user group, the user can obtain the operation permissions of the user group. A user group can be used to classify users and manage multiple users. + +Prerequisites +------------- + +- You have learned service requirements and created roles required by service scenarios. +- You have logged in to FusionInsight Manager. + +.. _admin_guide_000147__section205453863818: + +Creating a User Group +--------------------- + +#. Choose **System** > **Permission** > **User Group**. + +#. Above the user group list, click **Create User Group**. + +#. Set **Group Name** and **Description**. + + The group name contains 1 to 64 characters, including case-insensitive letters, digits, underscores (_), hyphens (-), and spaces. It cannot be the same as an existing user group name in the system. + +#. In the **Role** area, click **Add** to select a role and add it. + + .. note:: + + - For components (except HDFS and Yarn) for which Ranger authorization has been enabled, the permissions of non-default roles on Manager do not take effect. You need to configure Ranger policies to assign permissions to user groups. + - If the resource requests of HDFS and Yarn are beyond the Ranger policies, the ACL rules of the components still take effect. + +#. In the **User** area, click **Add** to select a user and add it. + +#. Click **OK**. + + The user group is created. + +Viewing User Group Information +------------------------------ + +By default, all user groups are displayed in the user group list. You can click the arrow on the left of a user group name to view details about the user group, including the user quantity, specific users, and bound roles of the user group. + +Modifying Information About a User Group +---------------------------------------- + +Locate the row that contains the target user group, and click **Modify** to modify its information. + +Exporting Information About a User Group +---------------------------------------- + +Click **Export All** to export all user group information at a time in **TXT** or **CSV** format. + +The exported user group information contains the user group name, description, user list, and role list. + +Deleting a User Group +--------------------- + +Locate the row that contains the target user group, and click **Delete**. To delete multiple user groups in batches, select the target user groups and click **Delete** above the user group list. A user group that contains users cannot be deleted. To delete such a user group, delete all its users by modifying the user group first. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/changing_a_user_password.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/changing_a_user_password.rst new file mode 100644 index 0000000..d5060b2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/changing_a_user_password.rst @@ -0,0 +1,74 @@ +:original_name: admin_guide_000143.html + +.. _admin_guide_000143: + +Changing a User Password +======================== + +Scenario +-------- + +For security purposes, the password of a human-machine user must be changed periodically. + +If users have the permission to use FusionInsight Manager, they can change their passwords on FusionInsight Manager. + +If users do not have the permission to use FusionInsight Manager, they can change their passwords on the client. + +Prerequisites +------------- + +- You have obtained the current password policy. +- The user has installed the client on any node in the cluster and obtained the IP address of the node. The password of the client installation user can be obtained from the administrator. + +Changing the Password on FusionInsight Manager +---------------------------------------------- + +#. Log in to FusionInsight Manager. + +#. Move the cursor to the username in the upper right corner of the page. + + On the user account drop-down menu, choose **Change Password**. + +#. On the displayed page, set **Current Password**, **New Password**, and **Confirm Password**, and click **OK**. + + By default, the password must meet the following complexity requirements: + + - The password contains at least 8 characters. + - The password must contain at least four types of the following characters: uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`\`~!@#$%^&*()-_=+|[{}];',<.>/\\?`). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked password. + - The password cannot be the same as the password used in the latest *N* times. *N* indicates the value of **Repetition Rule** configured in :ref:`Configuring Password Policies `. + +Changing the Password on the Client +----------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to switch to the client directory, for example, **/opt/client**: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Change the user password. This operation takes effect for all servers. + + **kpasswd** *System username* + + For example, if you want to change the password of system user **test1**, run the **kpasswd test1** command. + + By default, the password must meet the following complexity requirements: + + - The password contains at least 8 characters. + - The password must contain at least four types of the following characters: uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`\`~!@#$%^&*()-_=+|[{}];',<.>/\\?`). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked password. + - The password cannot be the same as the password used in the latest *N* times. *N* indicates the value of **Repetition Rule** configured in :ref:`Configuring Password Policies `. + + .. note:: + + If an error occurs during the running of the **kpasswd** command, try the following operations: + + - Stop the SSH session and start it again. + - Run the **kdestroy** command and then run the **kpasswd** command again. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/creating_a_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/creating_a_user.rst new file mode 100644 index 0000000..e9830cd --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/creating_a_user.rst @@ -0,0 +1,63 @@ +:original_name: admin_guide_000137.html + +.. _admin_guide_000137: + +Creating a User +=============== + +Scenario +-------- + +FusionInsight Manager supports a maximum of 50,000 users (including built-in users). By default, only user **admin** has the highest operation permissions of FusionInsight Manager. You need to create users on FusionInsight Manager and assign operation permissions to the users based on service requirements. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **User**. + +#. On the **User** page, click **Create**. + +#. Set **Username**. The username can contain digits, letters, underscores (_), hyphens (-), and spaces. It is case-insensitive and cannot be the same as any existing username in the system or OS. + +#. Set **User Type** to **Human-Machine** or **Machine-Machine**. + + - **Human-Machine** user: used for FusionInsight Manager O&M and component client operations. If you select this option, you also need to set **Password** and **Confirm Password**. + - **Machine-Machine** user: used for component application development. If you select this option, the password is randomly generated. + +#. In the **User Group** area, click **Add** to add one or more user groups to the list. + + .. note:: + + - If the selected user group has been bound to a role or a permission policy has been configured in Ranger, the user can obtain the corresponding permissions. + - After FusionInsight Manager is installed, some user groups generated by default have special permissions. Select desired user groups based on the descriptions on the UI. + - If existing user groups cannot meet your requirements, click **Create User Group** to create a user group. For details, see :ref:`Creating a User Group `. + +#. Select a group from the **Primary Group** drop-down list to create directories and files. + + The drop-down list contains all groups selected in **User Group**. + + .. note:: + + A user can belong to multiple groups (including the primary group and secondary groups). The primary group is set to facilitate maintenance and comply with the permission mechanism of the Hadoop community. The primary group has the same permission control functionality as other groups. + +#. In the **Role** area, click **Add** to bind roles to the user. + + .. note:: + + - Adding a role when you create a user can specify the user permissions. + + - If the permissions granted to the user from the user group cannot meet service requirements, you can bind other created roles to the user. You can also click **Create Role** to create a role first. For details, see :ref:`Creating a Role `. + + It takes 3 minutes to make role permission assignment to the user take effect. If the permissions obtained from the user group are enough, you do not need to add a role. + + - After Ranger authentication is enabled for a component, you need to configure Ranger policies to assign permissions to the user except the permissions of default user group or role. + + - If a user is not added to a user group or assigned a role, the user cannot view information or perform operations after logging in to FusionInsight Manager. + +#. Enter information in **Description**. + +#. Click **OK**. + + After a human-machine user is created, you need to change the initial password as prompted after logging in to FusionInsight Manager. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/deleting_a_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/deleting_a_user.rst new file mode 100644 index 0000000..229bbed --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/deleting_a_user.rst @@ -0,0 +1,30 @@ +:original_name: admin_guide_000142.html + +.. _admin_guide_000142: + +Deleting a User +=============== + +Scenario +-------- + +Based on service requirements, you can delete system users that are no longer used on FusionInsight Manager. + +.. note:: + + - After a user is deleted, the provisioned ticket granting ticket (TGT) is still valid within 24 hours. The user can use the TGT for security authentication and access the system. + - If a new user has the same name as the deleted user, the new user will inherit all owner permissions of the deleted user. You are advised to determine whether to delete the resources owned by the deleted user based on service requirements, for example, files in HDFS. + - The default user **admin** cannot be deleted. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **System** > **Permission** > **User**. +#. Locate the row that contains the target user, click **More**, and select **Delete**. + + .. note:: + + To delete users in batches, select the users at a time and click **Delete**. + +#. In the displayed dialog box, click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_an_authentication_credential_file.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_an_authentication_credential_file.rst new file mode 100644 index 0000000..61a8adb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_an_authentication_credential_file.rst @@ -0,0 +1,36 @@ +:original_name: admin_guide_000145.html + +.. _admin_guide_000145: + +Exporting an Authentication Credential File +=========================================== + +Scenario +-------- + +If a user uses a security mode cluster to develop applications, the keytab file of the user needs to be obtained for security authentication. You can export keytab files on FusionInsight Manager. + +.. note:: + + After a user password is changed, the exported keytab file becomes invalid, and you need to export a keytab file again. + +Prerequisites +------------- + +Before downloading the keytab file of a Human-Machine user, the password of the user must be changed at least once on the Manager portal or a client; otherwise, the downloaded keytab file cannot be used For details, see :ref:`Changing a User Password `. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **User**. + +#. Locate the row that contains the user whose keytab file needs to be exported, choose **More** > **Download Authentication Credential**, specify the save path after the file is automatically generated, and keep the file properly. + + The authentication credential includes the **krb5.conf** file of the Kerberos service. + + After the authentication credential file is decompressed, you can obtain the following two files: + + - The **krb5.conf** file contains the authentication service connection information. + - The **user.keytab** file contains user authentication information. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_user_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_user_information.rst new file mode 100644 index 0000000..dc58afe --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/exporting_user_information.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000139.html + +.. _admin_guide_000139: + +Exporting User Information +========================== + +Scenario +-------- + +You can export information about all created users on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **User**. + +#. Click **Export All** to export all user information at a time. + + The exported user information contains the username, creation time, description, user type (**0** indicates a human-machine account, **1** indicates a machine-machine account), primary group, user group list, and roles bound to the user. + +#. Set **Save AS** to **TXT** or **CSV**. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/index.rst new file mode 100644 index 0000000..20a842c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/index.rst @@ -0,0 +1,30 @@ +:original_name: admin_guide_000136.html + +.. _admin_guide_000136: + +Managing Users +============== + +- :ref:`Creating a User ` +- :ref:`Modifying User Information ` +- :ref:`Exporting User Information ` +- :ref:`Locking a User ` +- :ref:`Unlocking a User ` +- :ref:`Deleting a User ` +- :ref:`Changing a User Password ` +- :ref:`Initializing a Password ` +- :ref:`Exporting an Authentication Credential File ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_a_user + modifying_user_information + exporting_user_information + locking_a_user + unlocking_a_user + deleting_a_user + changing_a_user_password + initializing_a_password + exporting_an_authentication_credential_file diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/initializing_a_password.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/initializing_a_password.rst new file mode 100644 index 0000000..1b8cc41 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/initializing_a_password.rst @@ -0,0 +1,34 @@ +:original_name: admin_guide_000144.html + +.. _admin_guide_000144: + +Initializing a Password +======================= + +Scenario +-------- + +If a user forgets the password or the public account password needs to be changed periodically, you can initialize the password on FusionInsight Manager. After the password is initialized, the system user needs to change the password upon first login. + +.. note:: + + This operation applies only to human-machine users. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **User**. + +#. Locate the row that contains the target user, click **More**, and select **Initialize Password**. In the displayed dialog box, enter the password of the current login user and click **OK**. In the **Initialize Password** dialog box, click **OK**. + +#. Set **New Password** and **Confirm Password**, and click **OK**. + + The password must meet the following complexity requirements by default: + + - The password contains at least 8 characters. + - The password must contain at least four types of the following characters: uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`\`~!@#$%^&*()-_=+|[{}];',<.>/\\?`). + - The password cannot be the same as the username or the username spelled backwards. + - The password cannot be a common easily-cracked password. + - The password cannot be the same as the password used in the latest *N* times. *N* indicates the value of **Repetition Rule** configured in :ref:`Configuring Password Policies `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/locking_a_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/locking_a_user.rst new file mode 100644 index 0000000..1a25e34 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/locking_a_user.rst @@ -0,0 +1,31 @@ +:original_name: admin_guide_000140.html + +.. _admin_guide_000140: + +Locking a User +============== + +Scenario +-------- + +A user may be suspended for a long period of time due to service changes. For security purposes, you can lock such a user. + +You can lock a user in using either of the following methods: + +- Automatic locking: You can set **Password Retries** in the password policy to automatically lock the user whose login attempts exceed this parameter value. For details, see :ref:`Configuring Password Policies `. +- Manual locking: You manually lock a user. + +This section describes how to lock a user manually. Machine-machine users cannot be locked. + +Impact on the System +-------------------- + +A locked user cannot log in to FusionInsight Manager or perform identity authentication in the cluster. A locked user can be used only after being manually unlocked or the lock time expires. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **System** > **Permission** > **User**. +#. Locate the row that contains the target user and click **Lock** in the **Operation** column. +#. In the window that is displayed, select **I have read the information and understand the impact**. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/modifying_user_information.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/modifying_user_information.rst new file mode 100644 index 0000000..1a1ff62 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/modifying_user_information.rst @@ -0,0 +1,28 @@ +:original_name: admin_guide_000138.html + +.. _admin_guide_000138: + +Modifying User Information +========================== + +Scenario +-------- + +You can modify user information on FusionInsight Manager, including the user group, primary group, role permission assignment, and user description. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **User**. + +#. Locate the row that contains the target user and click **Modify** in the **Operation** column. + + Modify the parameters based on service requirements. + + .. note:: + + It takes three minutes at most for the change of the user group or role permissions to take effect. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/unlocking_a_user.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/unlocking_a_user.rst new file mode 100644 index 0000000..620e615 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/managing_users/unlocking_a_user.rst @@ -0,0 +1,19 @@ +:original_name: admin_guide_000141.html + +.. _admin_guide_000141: + +Unlocking a User +================ + +Scenario +-------- + +You can unlock a user on FusionInsight Manager if the user has been locked because the number of login attempts exceeds the threshold. Only users created on FusionInsight Manager can be unlocked. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **System** > **Permission** > **User**. +#. Locate the row that contains the target user and click **Unlock** in the **Operation** column. +#. In the window that is displayed, select **I have read the information and understand the impact**. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_password_policies.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_password_policies.rst new file mode 100644 index 0000000..da8acdb --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_password_policies.rst @@ -0,0 +1,53 @@ +:original_name: admin_guide_000150.html + +.. _admin_guide_000150: + +Configuring Password Policies +============================= + +Scenario +-------- + +To keep up with service security requirements, you can set password security rules, user login security rules, and user locking rules on FusionInsight Manager. + +.. important:: + + - Modify password policies based on service security requirements, because they involve user management security. Otherwise, security risks may be incurred. + - Change the user password after modifying the password policy, and then the new password policy can take effect. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **System** > **Permission** > **Security Policy** > **Password Policy**. + +#. Click **Modify** in the **Operation** column and modify the password policy as prompted. + + For details about the parameters, see :ref:`Table 1 `. + + .. _admin_guide_000150__table_1: + + .. table:: **Table 1** Password policy parameters + + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +============================================================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Minimum Password Length | Indicates the minimum number of characters a password contains. The value ranges from **8** to **32**. The default value is **8**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Character Types | Indicates how many character types in the following types a password can contain at least: uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=`). The value can be **4** or **5**. The default value is **4**, which means that a password can contain uppercase letters, lowercase letters, digits, and special characters. If you set the parameter to **5**, a password can contain all the five character types mentioned above. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password Retries | Indicates the number of consecutive wrong password attempts allowed before the system locks the user. The value ranges from **3** to **30**. The default value is **5**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Lock Duration (Min) | Indicates the time period in which a user is locked when the user lockout conditions are met. The value ranges from **5** to **120**. The default value is **5**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password Validity Period (Day) | Indicates the validity period of a password. The value ranges from **0** to **90**. **0** indicates that the password is permanently valid. The default value is **90**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Repetition Rule | Indicates the number of previous passwords that cannot be reused when you change the password. The value ranges from **1** to **5**. The default value is **1**. This policy applies to only human-machine accounts. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password Expiration Notification (Days) | Indicates the number of days in advance users are notified that their passwords are about to expire. After the value is set, if the difference between the cluster time and the password expiration time is smaller than this value, the user receives password expiration notifications. When logging in to FusionInsight Manager, the user will be notified that the password is about to expire and a message is displayed asking the user to change the password. The value ranges from **0** to *X* (*X* must be set to the half of the password validity period and rounded down). Value **0** indicates that no notification is sent. The default value is **5**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Interval for Deleting Authentication Failure Records (Min) | Indicates the interval of retaining incorrect password attempts. The value ranges from **0** to **1440**. **0** indicates that incorrect password attempts are permanently retained, and **1440** indicates that incorrect password attempts are retained for one day. The default value is **5**. | + +------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the configurations. Change the user password after modifying the password policy, and then the new password policy can take effect. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_the_independent_attribute.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_the_independent_attribute.rst new file mode 100644 index 0000000..b873445 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/configuring_the_independent_attribute.rst @@ -0,0 +1,64 @@ +:original_name: admin_guide_000151.html + +.. _admin_guide_000151: + +Configuring the Independent Attribute +===================================== + +Scenario +-------- + +User **admin** or administrators who are bound to the Manager_administrator role can configure the independent attribute on FusionInsight Manager so that common users (all service users in the cluster) can set or cancel their own independent attributes. + +After the independent attribute option is toggled on, service users need to log in to the system and set the independent attribute. + +Constraints +----------- + +- Administrators cannot set or cancel the independent attribute of a user. +- Administrators cannot obtain the authentication credentials of independent users. + +Prerequisites +------------- + +You have obtained the required administrator username and password. + +Procedure +--------- + +**Toggling On or Off the Independent Attribute** + +#. Log in to FusionInsight Manager as user **admin** or a user bound to the Manager_administrator role. +#. Choose **System** > **Permission** > **Security Policy** > **Independent Configurations**. +#. Toggle on or off **Independent Attribute**, enter the password as prompted, and click **OK**. +#. After the identity is authenticated, wait until the OMS configuration is modified and click **Finish**. + + .. note:: + + After the independent attribute is disabled: + + - A user who has the attribute can cancel it from the drop-down list of the username in the upper right corner of the page. The user cannot set the independent attribute again once it is cancelled. After the attribute is cancelled, existing independent tables will retain the attribute. However, the user cannot create independent tables again. + - Users without this attribute cannot set or cancel the attribute. + +**Configuring the Independent Attribute** + +5. Log in to FusionInsight Manager as a service user. + + .. important:: + + Administrators cannot initialize the password of the user after the independent attribute is set. If the user password is forgotten, the password cannot be retrieved. + + User **admin** cannot set the independent attribute. + +6. Move the cursor to the username in the upper right corner of the page. +7. Select **Set Independent** or **Cancel Independent**. + + .. note:: + + - If the independent attribute is toggled on and has been set for the service user, **Cancel Independent** is displayed. + - If the independent attribute is toggled on but has been cancelled for the service user, **Set Independent** is displayed. + - If the independent attribute is toggled off but has been set for the service user, **Cancel Independent** is displayed. + - If the independent attribute is toggled off and has been cancelled for the service user, no option related to the independent attribute is displayed. + +8. Enter the password as prompted and click **OK**. +9. After the identity is authenticated, click **OK** in the dialog box. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/index.rst new file mode 100644 index 0000000..fea080c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/configuring_permissions/security_policies/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000149.html + +.. _admin_guide_000149: + +Security Policies +================= + +- :ref:`Configuring Password Policies ` +- :ref:`Configuring the Independent Attribute ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_password_policies + configuring_the_independent_attribute diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/importing_a_certificate.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/importing_a_certificate.rst new file mode 100644 index 0000000..7e7a299 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/importing_a_certificate.rst @@ -0,0 +1,62 @@ +:original_name: admin_guide_000157.html + +.. _admin_guide_000157: + +Importing a Certificate +======================= + +Scenario +-------- + +CA certificates are used to encrypt data during communication between FusionInsight Manager modules and between cluster component clients and servers to ensure security. CA certificates can be quickly imported to FusionInsight Manager for product security. Import CA certificates in following scenarios: + +- When the cluster is installed for the first time, you need to replace the enterprise certificate. +- If the enterprise certificate has expired or security hardening is required, you need to replace it with a new certificate. + +Impact on the System +-------------------- + +- During certificate replacement, the cluster needs to be restarted. In this case, the system becomes inaccessible and cannot provide services. +- After the certificate is replaced, the certificates used by all components and FusionInsight Manager modules are automatically updated. +- After the certificate is replaced, you need to reinstall the certificate in the local environment where the certificate is not trusted. + +Prerequisites +------------- + +- You have generated the certificate file and key file or obtained them from the enterprise certificate administrator. + +- You have obtained the files to be imported to the cluster, including the CA certificate file (**\*.crt**), key file (**\*.key**), and file that saves the key file password (**password.property**). The certificate name and key name can contain uppercase letters, lowercase letters, and digits. After the preceding files are generated, compress them into a TAR package. + +- You have obtained a password for accessing the key file, for example, **Userpwd@123**. + + To avoid potential security risks, the password must meet the following complexity requirements: + + - It must contain at least eight characters. + - It must contain at least four of the following character types: uppercase letters, lowercase letters, digits, and special characters :literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=.` + +- When applying for certificates from the certificate administrator, you have provided the password for accessing the key file and applied for the certificate files in CRT, CER, CERT, and PEM formats and the key files in KEY and PEM formats. The requested certificates must have the issuing function. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **System** > **Certificate**. + +#. Click |image1| on the right of **Upload Certificate**. In the file selection window, browse to select the obtained TAR package of the certificate files. + +#. Click **Upload**. + + Manager uploads the compressed package and automatically imports the package. + +#. After the certificate is imported, the system displays a message asking you to synchronize the cluster configuration and restart the web service for the new certificate to take effect. Click **OK**. + +#. In the displayed dialog box, enter the password of the current login user and click **OK**. The cluster configuration is automatically synchronized and the web service is restarted. + +#. After the cluster is restarted, enter the URL for accessing FusionInsight Manager in the address box of the browser and check whether the FusionInsight Manager web page can be successfully displayed. + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster**, click the name of the target cluster, choose **Dashboard**, click **More**, and select **Restart**. + +#. In the displayed dialog box, enter the password of the current login user and click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899546.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/index.rst new file mode 100644 index 0000000..105e9f3 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000134.html + +.. _admin_guide_000134: + +System Configuration +==================== + +- :ref:`Configuring Permissions ` +- :ref:`Configuring Interconnections ` +- :ref:`Importing a Certificate ` +- :ref:`OMS Management ` +- :ref:`Component Management ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_permissions/index + configuring_interconnections/index + importing_a_certificate + oms_management/index + component_management/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/index.rst new file mode 100644 index 0000000..989fb51 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000159.html + +.. _admin_guide_000159: + +OMS Management +============== + +- :ref:`Overview of the OMS Page ` +- :ref:`Modifying OMS Service Configuration Parameters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview_of_the_oms_page + modifying_oms_service_configuration_parameters diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/modifying_oms_service_configuration_parameters.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/modifying_oms_service_configuration_parameters.rst new file mode 100644 index 0000000..066cf4e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/modifying_oms_service_configuration_parameters.rst @@ -0,0 +1,95 @@ +:original_name: admin_guide_000162.html + +.. _admin_guide_000162: + +Modifying OMS Service Configuration Parameters +============================================== + +Scenario +-------- + +Based on the security requirements of the user environment, you can modify the Kerberos and LDAP configurations in the OMS on FusionInsight Manager. + +Impact on the System +-------------------- + +After the OMS service configuration parameters are modified, the corresponding OMS module needs to be restarted. In this case, FusionInsight Manager cannot be used. + +Procedure +--------- + +**Modifying the okerberos configuration** + +#. Log in to FusionInsight Manager and choose **System** > **OMS**. + +2. Locate the row that contains okerberos and click **Modify Configuration**. + +3. Modify the parameters according to :ref:`Table 1 `. + + .. _admin_guide_000162__table19796438111412: + + .. table:: **Table 1** okerberos parameters + + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==========================+================================================================================================================+ + | KDC Timeout (ms) | Timeout duration for an application to connect to Kerberos, in milliseconds. The value must be an integer. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | Max Retries | Maximum number of retries for an application to connect to Kerberos, in seconds. The value must be an integer. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | LDAP Timeout (ms) | Timeout duration for Kerberos to connect to LDAP, in milliseconds. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | LDAP Search Timeout (ms) | Timeout duration for Kerberos to query user information in LDAP, in milliseconds. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | Kadmin Listening Port | Port number of the Kadmin service. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | KDC Listening Port | Port number of the kinit service. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + | Kpasswd Listening Port | Port number of the Kpasswd service. | + +--------------------------+----------------------------------------------------------------------------------------------------------------+ + +4. Click **OK**. + + In the displayed dialog box, enter the password of the current login user and click **OK**. In the displayed confirmation dialog box, click **OK**. + +**Modifying the oldap configuration** + +5. Locate the row that contains the oldap and click **Modify Configuration**. + +6. Modify the parameters according to :ref:`Table 2 `. + + .. _admin_guide_000162__table1696932817285: + + .. table:: **Table 2** OLDAP parameters + + =================== ================================ + Parameter Description + =================== ================================ + LDAP Listening Port Port number of the LDAP service. + =================== ================================ + +7. Click **OK**. + + In the displayed dialog box, enter the password of the current login user and click **OK**. In the displayed confirmation dialog box, click **OK**. + + .. note:: + + To reset the password of the LDAP account, you need to restart ACS. The procedure is as follows: + + a. Log in to the active management node as user **omm** using PuTTY, and run the following command to update the domain configuration: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/restart-RealmConfig.sh** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + Modify realm successfully. Use the new password to log into FusionInsight again. + + b. Run the **sh $CONTROLLER_HOME/sbin/acs_cmd.sh stop** command to stop ACS. + + c. Run the **sh $CONTROLLER_HOME/sbin/acs_cmd.sh start** command to start ACS. + +**Restarting the cluster** + +8. Log in to FusionInsight Manager and restart the cluster by referring to :ref:`Performing a Rolling Restart of a Cluster `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/overview_of_the_oms_page.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/overview_of_the_oms_page.rst new file mode 100644 index 0000000..4ecf93f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/system_configuration/oms_management/overview_of_the_oms_page.rst @@ -0,0 +1,62 @@ +:original_name: admin_guide_000160.html + +.. _admin_guide_000160: + +Overview of the OMS Page +======================== + +Overview +-------- + +Log in to FusionInsight Manager and choose **System** > **OMS**. You can perform maintenance operations on the OMS page, including viewing basic information, viewing the service status of OMS service modules, and manually triggering health checks. + +.. note:: + + OMS is the management node of the O&M system. Generally, there are two OMS nodes that work in active/standby mode. + +Basic Information +----------------- + +OMS-associated information is displayed on FusionInsight Manager, as listed in :ref:`Table 1 `. + +.. _admin_guide_000160__table14579151510169: + +.. table:: **Table 1** OMS information + + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Item | Description | + +=================+===========================================================================================================================================================+ + | Version | Indicates the OMS version, which is consistent with the FusionInsight Manager version. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | IP Mode | Indicates the IP address mode of the current cluster network. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HA Mode | Indicates the OMS working mode, which is specified by the configuration file during FusionInsight Manager installation. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Current Active | Indicates the host name of the active OMS node, that is, the host name of the active management node. Click a host name to go to the host details page. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Current Standby | Indicates the host name of the standby OMS node, that is, the host name of the standby management node. Click a host name to go to the host details page. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Duration | Indicates the duration for starting the OMS process. | + +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + +OMS Service Status +------------------ + +FusionInsight Manager displays the running status of all OMS service modules. If the status of each service module is displayed as |image1|, the OMS is running properly. + +Health Check +------------ + +You can click **Health Check** on the OMS page to check the OMS status. If some check items are faulty, you can view the check description for troubleshooting. + +Entering or Exiting Maintenance Mode +------------------------------------ + +Configure OMS to enter or exit the maintenance mode. + +System Parameters +----------------- + +Connect to the DMPS cluster in large-scale cluster scenarios. + +.. |image1| image:: /_static/images/en-us_image_0263899675.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/index.rst new file mode 100644 index 0000000..e786cf5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000087.html + +.. _admin_guide_000087: + +Tenant Resources +================ + +- :ref:`Multi-Tenancy ` +- :ref:`Using the Superior Scheduler ` +- :ref:`Using the Capacity Scheduler ` +- :ref:`Switching the Scheduler ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + multi-tenancy/index + using_the_superior_scheduler/index + using_the_capacity_scheduler/index + switching_the_scheduler diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/index.rst new file mode 100644 index 0000000..c711871 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000088.html + +.. _admin_guide_000088: + +Multi-Tenancy +============= + +- :ref:`Overview ` +- :ref:`Technical Principles ` +- :ref:`Multi-Tenancy Usage ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + technical_principles/index + multi-tenancy_usage/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/index.rst new file mode 100644 index 0000000..78c51d6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/index.rst @@ -0,0 +1,16 @@ +:original_name: admin_guide_000096.html + +.. _admin_guide_000096: + +Multi-Tenancy Usage +=================== + +- :ref:`Overview ` +- :ref:`Process Overview ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + process_overview diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/overview.rst new file mode 100644 index 0000000..11b2927 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/overview.rst @@ -0,0 +1,41 @@ +:original_name: admin_guide_000097.html + +.. _admin_guide_000097: + +Overview +======== + +Tenants are used in resource control and service isolation scenarios. Administrators need to determine the service scenarios of cluster resources and then plan tenants. + +.. note:: + + - Yarn in a new cluster uses the Superior scheduler by default. For details, see :ref:`Using the Superior Scheduler `. + +Multi-tenancy involves three types of operations: creating a tenant, managing tenants, and managing resources. :ref:`Table 1 ` describes these operations. + +.. _admin_guide_000097__tfa862b93a42b4565924842697866fde8: + +.. table:: **Table 1** Multi-tenant operations + + +-----------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Action | Description | + +=======================+=============================================================+=============================================================================================================================================================================================================================================+ + | Creating a tenant | - Add a tenant. | During the creation of a tenant, you can configure its computing resources, storage resources, and associated services based on service requirements. In addition, you can add users to the tenant and bind necessary roles to these users. | + | | - Add a sub-tenant. | | + | | - Create a user and bind the user to the role of a tenant. | A user to create a level-1 tenant needs to be bound to the **Manager_administrator** or **System_administrator** role. | + | | | | + | | | A user to create a sub-tenant needs to be bound to the role of the parent tenant at least. | + +-----------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Managing tenants | - Manage the tenant directory. | You can edit tenants as services change. | + | | - Restore tenant data. | | + | | - Clear non-associated queues of a tenant. | A user to manage or delete a level-1 tenant or restore tenant data needs to be bound to the **Manager_administrator** or **System_administrator** role. | + | | - Delete a tenant. | | + | | | A user to manage or delete a sub-tenant needs to be bound to the role of the parent tenant at least. | + +-----------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Managing resources | - Create a resource pool. | You can reconfigure resources for tenants as the services change. | + | | - Modify a resource pool. | | + | | - Delete a resource pool. | A user to manage resources needs to be bound to the **Manager_administrator** or **System_administrator** role. | + | | - Configure a queue. | | + | | - Configure the queue capacity policy of a resource pool. | | + | | - Clear configurations of a queue. | | + +-----------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/process_overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/process_overview.rst new file mode 100644 index 0000000..e4eb339 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/multi-tenancy_usage/process_overview.rst @@ -0,0 +1,32 @@ +:original_name: admin_guide_000098.html + +.. _admin_guide_000098: + +Process Overview +================ + +Administrators need to determine the service scenarios of cluster resources and then plan tenants. After that, administrators add tenants and configure dynamic resources, storage resources, and associated services for the tenants on FusionInsight Manager. + +:ref:`Process Overview ` shows the process for creating a tenant. + + +.. figure:: /_static/images/en-us_image_0263899222.png + :alt: **Figure 1** Creating a tenant + + **Figure 1** Creating a tenant + +:ref:`Table 1 ` describes the operations for creating a tenant. + +.. _admin_guide_000098__t986256797c9741f79fb27fed0559d8cb: + +.. table:: **Table 1** Operations for creating a tenant + + +--------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Description | + +==================================================+=======================================================================================================================================================================================================+ + | Add a tenant. | You can configure the computing resources, storage resources, and associated services of the tenant. | + +--------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add a sub-tenant. | You can configure the computing resources, storage resources, and associated services of the sub-tenant. | + +--------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add a user and bind the user to the tenant role. | If a user wants to use the resources of tenant **tenant1** or add or delete sub-tenants for **tenant1**, the user must be bound to both the **Manager_tenant** and **tenant1\_**\ *Cluster ID* roles. | + +--------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/overview.rst new file mode 100644 index 0000000..36a6969 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/overview.rst @@ -0,0 +1,43 @@ +:original_name: admin_guide_000089.html + +.. _admin_guide_000089: + +Overview +======== + +Definition +---------- + +Multi-tenancy refers to multiple resource sets (a resource set is a tenant) in the MRS big data cluster and is able to allocate and schedule resources. The resources include computing resources and storage resources. + +Context +------- + +Modern enterprises' data clusters are becoming more and more centralized and cloud-based. Enterprise-class big data clusters must meet the following requirements: + +- Carry data of different types and formats and run jobs and applications of different types (such analysis, query, and stream processing). +- Isolate data of a user from that of another user who has demanding requirements on data security, such as a bank or government institute. + +The preceding requirements bring the following challenges to the big data clusters: + +- Proper allocation and scheduling of resources to ensure stable operating of applications and jobs. +- Strict access control to ensure data and service security. + +Multi-tenancy isolates the resources of a big data cluster into resource sets. Users can lease desired resource sets to run applications and jobs and store data. In a big data cluster, multiple resource sets can be deployed to meet diverse requirements of multiple users. + +The MRS big data cluster provides a complete enterprise-class big data multi-tenant solution. + +Highlights +---------- + +- Proper resource configuration and isolation + + The resources of a tenant are isolated from those of another tenant. The resource use of a tenant does not affect other tenants. This mechanism ensures that each tenant can configure resources based on service requirements, improving resource utilization. + +- Resource consumption measurement and statistics + + Tenants are system resource applicants and consumers. System resources are planned and allocated based on tenants. Resource consumption by tenants can be measured and collected. + +- Assured data security and access security + + In multi-tenant scenarios, the data of each tenant is stored separately to ensure data security. The access to tenants' resources is controlled to ensure access security. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/dynamic_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/dynamic_resources.rst new file mode 100644 index 0000000..c1042d6 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/dynamic_resources.rst @@ -0,0 +1,92 @@ +:original_name: admin_guide_000094.html + +.. _admin_guide_000094: + +Dynamic Resources +================= + +Overview +-------- + +Yarn provides distributed resource management for a big data cluster. The total volume of resources allocated to Yarn can be configured. Then Yarn allocates and schedules computing resources for job queues. The computing resources of MapReduce, Spark, Flink, and Hive job queues are allocated and scheduled by Yarn. + +Yarn queues are fundamental units of scheduling computing resources. + +The resources obtained by tenants using Yarn queues are dynamic resources. Users can dynamically create and modify the queue quotas and view the status and statistics of the queues. + +Resource Pools +-------------- + +Nowadays, enterprise IT systems often face complex cluster environments and diverse upper-layer requirements. For example: + +- Heterogeneous cluster: The computing speed, storage capacity, and network performance of each node in the cluster are different. All the tasks of complex applications need to be properly allocated to each compute node in the cluster based on service requirements. +- Computing isolation: Data must be shared among multiple departments but computing resources must be distributed onto different compute nodes. + +These require that the compute nodes be further partitioned. + +Resource pools are used to specify the configuration of dynamic resources. Yarn queues are associated with resource pools for resource allocation and scheduling. + +One tenant can have only one default resource pool. Users can be bound to the role of a tenant to use the resources in the resource pool of the tenant. To use resources in multiple resource pools, a user can be bound to roles of multiple tenants. + +Scheduling Mechanism +-------------------- + +Yarn dynamic resources support label-based scheduling. This policy creates labels for compute nodes (Yarn NodeManagers) and adds the compute nodes with the same label into the same resource pool. Then Yarn dynamically associates the queues with resource pools based on the resource requirements of the queues. + +For example, a cluster has more than 40 nodes which are labeled by **Normal**, **HighCPU**, **HighMEM**, or **HighIO** based on their hardware and network configurations and added into four resource pools, respectively. :ref:`Table 1 ` describes the performance of each node in the resource pool. + +.. _admin_guide_000094__t88a97f43aa8049388a46adeaaa2b19cc: + +.. table:: **Table 1** Performance of each node in a resource pool + + +---------+-----------------+------------------------------------+-----------------+---------------------------+ + | Label | Number of Nodes | Hardware and Network Configuration | Added To | Associated With | + +=========+=================+====================================+=================+===========================+ + | Normal | 10 | General | Resource pool A | Common queue | + +---------+-----------------+------------------------------------+-----------------+---------------------------+ + | HighCPU | 10 | High-performance CPU | Resource pool B | Computing-intensive queue | + +---------+-----------------+------------------------------------+-----------------+---------------------------+ + | HighMEM | 10 | Large memory | Resource pool C | Memory-intensive queue | + +---------+-----------------+------------------------------------+-----------------+---------------------------+ + | HighIO | 10 | High-performance network | Resource pool D | I/O-intensive queue | + +---------+-----------------+------------------------------------+-----------------+---------------------------+ + +A queue can use only the compute nodes in its associated resource pool. + +- A common queue is associated with resource pool A and uses **Normal** nodes with general hardware and network configurations. +- A computing-intensive queue is associated with resource pool B and uses **HighCPU** nodes with high-performance CPUs. +- A memory-intensive queue is associated with resource pool C and uses **HighMEM** nodes with large memory. +- An I/O-intensive queue is associated with resource pool C and uses **HighIO** nodes with high-performance network. + +Yarn queues are associated with specified resource pools to efficiently utilize resources in resource pools and maximize node performance. + +FusionInsight Manager supports a maximum of 50 resource pools. The system has a default resource pool. + +Schedulers +---------- + +By default, the Superior scheduler is enabled for the MRS cluster. + +- The Superior scheduler is an enhanced version and named after the Lake Superior, indicating that the scheduler can manage a large amount of data. + +To meet enterprise requirements and tackle scheduling challenges faced by the Yarn community, the Superior scheduler makes the following enhancements: + +- Enhanced resource sharing policy + + The Superior scheduler supports queue hierarchy. It integrates the functions of open-source schedulers and shares resources based on configurable policies. In terms of instances, administrators can use the Superior scheduler to configure an absolute value or percentage policy for queue resources. The resource sharing policy of the Superior scheduler enhances label-based scheduling of Yarn as a resource pool feature. The nodes in the Yarn cluster can be grouped based on the capacity or service type to ensure that queues can more efficiently utilize resources. + +- Tenant-based resource reservation policy + + Some tenants may run critical tasks at some time, and their resource requirements must be preferentially addressed. The Superior scheduler builds a mechanism to support the resource reservation policy. Reserved resources can be allocated to the critical tasks running in the specified tenant queues in a timely manner to ensure proper task execution. + +- Fair sharing among tenants and resource pool users + + The Superior scheduler allows shared resources to be configured for users in a queue. Each tenant may have users with different weights. Heavily weighted users may require more shared resources. + +- Ensured scheduling performance in a big cluster + + The Superior scheduler receives heartbeats from each NodeManager and saves resource information in memory, which enables the scheduler to control cluster resource usage globally. The Superior scheduler uses the push scheduling model, which makes the scheduling more precise and efficient and remarkably improves cluster resource utilization. Additionally, the Superior scheduler delivers excellent performance when the interval between NodeManager heartbeats is long and prevents heartbeat storms in big clusters. + +- Priority policy + + If the minimum resource requirement of a service cannot be met after the service obtains all available resources, a preemption occurs. The preemption function is disabled by default. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/index.rst new file mode 100644 index 0000000..f28225b --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/index.rst @@ -0,0 +1,22 @@ +:original_name: admin_guide_000090.html + +.. _admin_guide_000090: + +Technical Principles +==================== + +- :ref:`Multi-Tenant Management ` +- :ref:`Multi-Tenant Model ` +- :ref:`Resource Overview ` +- :ref:`Dynamic Resources ` +- :ref:`Storage Resources ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + multi-tenant_management + multi-tenant_model + resource_overview + dynamic_resources + storage_resources diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_management.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_management.rst new file mode 100644 index 0000000..105a0b0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_management.rst @@ -0,0 +1,108 @@ +:original_name: admin_guide_000091.html + +.. _admin_guide_000091: + +Multi-Tenant Management +======================= + +Unified Multi-Tenant Management +------------------------------- + +Log in to FusionInsight Manager and choose **Tenant Resources** > **Tenant Resources Management**. On the page that is displayed, you can find that FusionInsight Manager is a unified multi-tenant management platform that integrates multiple functions such as tenant lifecycle management, tenant resource configuration, tenant service association, and tenant resource usage statistics, delivering a mature multi-tenant management model and achieving centralized tenant and service management. + +**Graphical User Interface** + +FusionInsight Manager provides the graphical multi-tenant management interface and manages and operates multiple levels of tenants using the tree structure. Additionally, FusionInsight Manager integrates the basic information and resource quota of the current tenant in one interface to facilitate O&M and management, as shown in :ref:`Figure 1 `. + +.. _admin_guide_000091__fig2773161717323: + +.. figure:: /_static/images/en-us_image_0000001369960209.png + :alt: **Figure 1** Tenant management page of FusionInsight Manager + + **Figure 1** Tenant management page of FusionInsight Manager + +**Hierarchical Tenant Management** + +FusionInsight Manager supports a hierarchical tenant management model in which you can add sub-tenants to an existing tenant to re-configure resources. Sub-tenants of level-1 tenants are level-2 tenants. So on and so forth. FusionInsight Manager provides enterprises with a field-tested multi-tenant management model, enabling centralized tenant and service management. + +Simplified Permission Management +-------------------------------- + +FusionInsight Manager hides internal permission management details from common users and simplifies permission management operations for administrators, improving usability and user experience of tenant permission management. + +- FusionInsight Manager employs role-based access control (RBAC) to configure different permissions for users based on service scenarios during multi-tenant management. +- The administrator of tenants has tenant management permissions, including viewing resources and services of the current tenant, adding or deleting sub-tenants of the current tenant, and managing permissions of sub-tenants' resources. FusionInsight Manager supports setting of the administrator for a single tenant so that the management over this tenant can be delegated to a user who is not the system administrator. +- Roles of a tenant have all permissions on the computing resources and storage resources of the tenant. When a tenant is created, the system automatically creates roles for this tenant. You can add a user and bind the user to the tenant roles so that the user can use the resources of the tenant. + +Clear Resource Management +------------------------- + +- **Self-Service Resource Configuration** + + In FusionInsight Manager, you can configure the computing resources and storage resources during the creation of a tenant and add, modify, or delete the resources of the tenant. + + Permissions of the roles that are associated with a tenant are updated automatically when you modify the computing or storage resources of the tenant. + +- **Resource Usage Statistics** + + Resource usage statistics are critical for administrators to determine O&M activities based on the status of cluster applications and services, improving the cluster O&M efficiency. FusionInsight Manager displays the resource statistics of tenants in **Resource Quota**, including the vCores, memory, and HDFS storage resources. + + .. note:: + + - **Resource Quota** dynamically calculates the resource usage of tenants. + + |image1| + + The available resources of the Superior scheduler are calculated as follows: + + - Superior + + The available Yarn resources (memory and CPU) are allocated in proportion based on the queue weight. + + - When the tenant administrator is bound to a tenant role, the tenant administrator has the permissions to manage the tenant and use all resources of the tenant. + +- **Graphical Resource Monitoring** + + Graphical resource monitoring supports the graphical display of monitoring metrics listed in :ref:`Table 1 `, as shown in :ref:`Figure 2 `. + + .. _admin_guide_000091__fig136061232032: + + .. figure:: /_static/images/en-us_image_0263899641.png + :alt: **Figure 2** Refined monitoring + + **Figure 2** Refined monitoring + + By default, the real-time monitoring data is displayed. You can click |image2| to customize a time range. The default time ranges include 4 hours, 8 hours, 12 hours, 1 day, 1 week, and 1 month. Click |image3| and select **Export** to export the monitoring metric information. + + .. _admin_guide_000091__table3621114917574: + + .. table:: **Table 1** Monitoring metrics + + +-----------------------+-----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Metric Item | Description | + +=======================+=========================================+=================================================================================================================================================+ + | HDFS | HDFS Tenant Space Details | HDFS can monitor a specified storage directory. The storage directory is the same as the directory added by the current tenant in **Resource**. | + | | | | + | | - Allocated Space | | + | | - Used Space | | + +-----------------------+-----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HDFS Tenant File Object Details | | + | | | | + | | - Number of Used File Objects | | + +-----------------------+-----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Yarn | Yarn Allocated Cores | Monitoring information of the current tenant is displayed. If no sub-item is configured for a tenant, this information is not displayed. | + | | | | + | | - Maximum Number of CPU Cores in an AM | The monitoring data is obtained from **Scheduler** > **Application Queues** > **Queue:** *Tenant name* on the native web UI of Yarn. | + | | - Allocated Cores | | + | | - Number of Used CPU Cores in an AM | | + +-----------------------+-----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Yarn Allocated Memory | | + | | | | + | | - Allocated Maximum AM Memory | | + | | - Allocated Memory | | + | | - Used AM Memory | | + +-----------------------+-----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001169371695.png +.. |image2| image:: /_static/images/en-us_image_0000001370085637.png +.. |image3| image:: /_static/images/en-us_image_0263899288.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_model.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_model.rst new file mode 100644 index 0000000..6886334 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/multi-tenant_model.rst @@ -0,0 +1,109 @@ +:original_name: admin_guide_000092.html + +.. _admin_guide_000092: + +Multi-Tenant Model +================== + +Related Model +------------- + +The following figure shows a multi-tenant model. + +.. _admin_guide_000092__f486ae0dbdc8a4d6285e9d6e8ac5cbde0: + +.. figure:: /_static/images/en-us_image_0263899456.png + :alt: **Figure 1** Multi-tenant model + + **Figure 1** Multi-tenant model + +:ref:`Table 1 ` describes the concepts involved in :ref:`Figure 1 `. + +.. _admin_guide_000092__t8493e085f6bc470eb314c82866f86756: + +.. table:: **Table 1** Concepts in the model + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Concept | Description | + +===================================+======================================================================================================================================================================================================================================================================================+ + | User | A natural person who has a username and password and uses the big data cluster. | + | | | + | | There are three different users in :ref:`Figure 1 `: user A, user B, and user C. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Role | A role is a carrier of one or more permissions. Permissions are assigned to specific objects, for example, access permissions for the **/tenant** directory in HDFS. | + | | | + | | :ref:`Figure 1 ` shows four roles: **t1**, **t2**, **t3**, and **Manager_tenant**. | + | | | + | | - Roles **t1**, **t2**, and **t3** are automatically generated when tenants are created. The role names are the same as the tenant names. That is, roles **t1**, **t2**, and **t3** map to tenants **t1**, **t2**, and **t3**. Role names and tenant names need to be used in pair. | + | | - Role **Manager_tenant** is defaulted in the cluster and cannot be used separately. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant | A tenant is a resource set in a big data cluster. Multiple tenants are referred to as multi-tenancy. The resource sets further divided under a tenant are called sub-tenants. | + | | | + | | :ref:`Figure 1 ` shows three tenants: **t1**, **t2**, and **t3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource | - Computing resources include CPUs and memory. | + | | | + | | The computing resources of a tenant are allocated from the total computing resources in the cluster. One tenant cannot occupy the computing resources of another tenant. | + | | | + | | In :ref:`Figure 1 `, computing resources 1, 2, and 3 are allocated for tenants **t1**, **t2**, and **t3** respectively from the cluster's computing resources. | + | | | + | | - Storage resources include disks and third-party storage systems. | + | | | + | | The storage resources of a tenant are allocated from the total storage resources in the cluster. One tenant cannot occupy the storage resources of another tenant. | + | | | + | | In :ref:`Figure 1 `, storage resources 1, 2, and 3 are allocated for tenants **t1**, **t2**, and **t3** respectively from the cluster's storage resources. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +If a user wants to use a tenant's resources or add or delete a sub-tenant of a tenant, the user needs to be bound to both the tenant role and role **Manager_tenant**. :ref:`Table 2 ` lists the roles bound to each user in :ref:`Figure 1 `. + +.. _admin_guide_000092__tc4dc7a31593b48ab9ea2b09ea1bfc64d: + +.. table:: **Table 2** Roles bound to each user + + +-----------------------+----------------------------+--------------------------------------------------------------+ + | User | Role | Permission | + +=======================+============================+==============================================================+ + | User A | - Role **t1** | - Uses the resources of tenants **t1** and **t2**. | + | | - Role **t2** | - Adds or deletes sub-tenants of tenants **t1** and **t2**. | + | | - Role **Manager_tenant** | | + +-----------------------+----------------------------+--------------------------------------------------------------+ + | User B | - Role **t3** | - Uses the resources of tenant **t3**. | + | | - Role **Manager_tenant** | - Adds or deletes sub-tenants of tenant **t3**. | + +-----------------------+----------------------------+--------------------------------------------------------------+ + | User C | - Role **t1** | - Uses the resources of tenant **t1**. | + | | - Role **Manager_tenant** | - Adds or deletes sub-tenants of tenant **t1**. | + +-----------------------+----------------------------+--------------------------------------------------------------+ + +A user can be bound to multiple roles, and one role can also be bound to multiple users. Users are associated with tenants after being bound to the tenant roles. Therefore, tenants and users form a many-to-many relationship. One user can use the resources of multiple tenants, and multiple users can use the resources of the same tenant. For example, in :ref:`Figure 1 `, user A uses the resources of tenants **t1** and **t2**, and users A and C uses the resources of tenant **t1**. + +.. note:: + + The concepts of a parent tenant, sub-tenant, level-1 tenant, and level-2 tenant are all designed for the multi-tenant service scenarios. Pay attention to the differences these concepts and the concepts of a leaf tenant resource and non-leaf tenant resource on FusionInsight Manager. + + - Level-1 tenant: determined based on the tenant's level. For example, the first created tenant is a level-1 tenant and its sub-tenant is a level-2 tenant. + - Parent tenant and sub-tenant: indicates the hierarchical relationship between tenants. + - Non-leaf tenant resource: indicates the tenant type selected during tenant creation. This tenant type can be used to create sub-tenants. + - Leaf tenant resource: indicates the tenant type selected during tenant creation. This tenant type cannot be used to create sub-tenants. + +Multi-Tenant Platform +--------------------- + +Tenant is a core concept of the FusionInsight big data platform. It plays an important role in big data platforms' transformation from user-centered to multi-tenant to keep up with enterprises' multi-tenant application environments. :ref:`Figure 2 ` shows the transformation of big data platforms. + +.. _admin_guide_000092__f0b6aaf15c16f487fa23a1a04eb45754f: + +.. figure:: /_static/images/en-us_image_0263899420.png + :alt: **Figure 2** Platform transformation from user-centered to multi-tenant + + **Figure 2** Platform transformation from user-centered to multi-tenant + +On a user-centered big data platform, users can directly access and use all resources and services. + +- However, user applications may use only partial cluster resources, resulting in low resource utilization. +- The data of different users may be stored together, decreasing data security. + +On a multi-tenant big data platform, users use required resources and services by accessing the tenants. + +- Resources are allocated and scheduled based on application requirements and used based on tenants, increasing resource utilization. +- Users can access the resources of tenants only after being associated with tenant roles, enhancing access security. +- The data of tenants is isolated, ensuring data security. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/resource_overview.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/resource_overview.rst new file mode 100644 index 0000000..933bb03 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/resource_overview.rst @@ -0,0 +1,38 @@ +:original_name: admin_guide_000093.html + +.. _admin_guide_000093: + +Resource Overview +================= + +MRS cluster resources are classified into computing resources and storage resources. The multi-tenant architecture implements resource isolation. + +- **Computing resources** + + Computing resources include CPUs and memory. One tenant cannot occupy the computing resources of another tenant. + +- **Storage resources** + + Storage resources include disks and third-party storage systems. One tenant cannot access the data of another tenant. + +Computing Resources +------------------- + +Computing resources are divided into static service resources and dynamic resources. + +- **Static Service Resources** + + Static service resources are computing resources allocated to each service and are not shared between services. The total computing resources of each service are fixed. These services include Flume, HBase, HDFS, and Yarn. + +- **Dynamic Resources** + + Dynamic resources are computing resources dynamically scheduled to a job queue by the distributed resource management service Yarn. Yarn dynamically schedules resources for the job queues of MapReduce, Spark2x, Flink, and Hive. + +.. note:: + + The resources allocated to Yarn in a big data cluster are static service resources but can be dynamically allocated to job queues by Yarn. + +Storage Resources +----------------- + +Storage resources are data storage resources that can be allocated by the distributed file storage service HDFS. Directory is the basic unit of allocating HDFS storage resources. Tenants can obtain storage resources from the specified directories in the HDFS file system. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/storage_resources.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/storage_resources.rst new file mode 100644 index 0000000..faa72a0 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/multi-tenancy/technical_principles/storage_resources.rst @@ -0,0 +1,27 @@ +:original_name: admin_guide_000095.html + +.. _admin_guide_000095: + +Storage Resources +================= + +Overview +-------- + +As a distributed file storage service in a big data cluster, HDFS stores all the user data of the upper-layer applications in the big data cluster, including the data written to HBase tables or Hive tables. + +A directory is the basic unit of allocating HDFS storage resources. HDFS supports the conventional hierarchical file structure. Users or applications can create directories and create, delete, move, or rename files in directories. Tenants can obtain storage resources from specified directories in the HDFS file system. + +Scheduling Mechanism +-------------------- + +HDFS directories can be stored on nodes with specified labels or disks of specified hardware types. For example: + +- When both real-time query and data analysis tasks are running in the same cluster, the real-time query tasks need to be deployed only on certain nodes, and the task data must also be stored on these nodes. +- Based on actual service requirements, key data needs to be stored on highly reliable nodes. + +Administrators can flexibly configure HDFS data storage policies based on actual service requirements and data features to store data on specified nodes. + +For tenants, storage resources refer to the HDFS resources they use. Data of specified directories can be stored to the tenant-specified storage paths, thereby implementing storage resource scheduling and ensuring data isolation between tenants. + +Users can add or delete HDFS storage directories of tenants and set the file quantity quota and storage capacity quota of directories to manage storage resources. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/switching_the_scheduler.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/switching_the_scheduler.rst new file mode 100644 index 0000000..2abf082 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/switching_the_scheduler.rst @@ -0,0 +1,124 @@ +:original_name: admin_guide_000133.html + +.. _admin_guide_000133: + +Switching the Scheduler +======================= + +Scenario +-------- + +The newly installed MRS cluster uses the Superior scheduler by default. If the cluster is upgraded from an earlier version, you can switch the YARN scheduler from the Capacity scheduler to the Superior scheduler with a few clicks. + +Prerequisites +------------- + +- The network connectivity of the cluster is proper and secure, and the YARN service status is normal. +- During scheduler switching, tenants cannot be added, deleted, or modified. In addition, services cannot be started or stopped. + +Impact on the System +-------------------- + +- Because the ResourceManager is restarted during scheduler switching, submitting jobs to YARN will fail at that time. +- During scheduler switching, tasks in a job being executed on YARN will continue, but new tasks cannot be started. +- After scheduler switching is complete, jobs executed on YARN may fail, causing service interruptions. +- After scheduler switching is complete, parameters of the Superior scheduler are used for tenant management. +- After scheduler switching is complete, tenant queues whose capacity is 0 in the Capacity scheduler cannot be allocated resources in the Superior scheduler. As a result, jobs submitted to these tenant queues fail to be executed. Therefore, you are advised not to set the capacity of a tenant queue to 0 in the Capacity scheduler. +- After scheduler switching is complete, you cannot add or delete resource pools, YARN node labels, or tenants during the observation period. If such an operation is performed, the scheduler cannot be rolled back to the Capacity scheduler. + + .. note:: + + - The recommended observation period for scheduler switching is one week. If resource pools, YARN node labels, or tenants are added or deleted during this period, the observation period ends immediately. + +- The scheduler rollback may cause the loss of partial or all YARN job information. + +Switching from the Capacity Scheduler to the Superior Scheduler +--------------------------------------------------------------- + +#. Modify YARN service parameters and ensure that the YARN service status is normal. + + a. Log in to FusionInsight Manager as an administrator. + + b. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Yarn**. Click **Configurations** then **All Configurations**, search for **yarn.resourcemanager.webapp.pagination.enable**, and check whether the value is **true**. + + - If yes, go to :ref:`1.c `. + - If no, set the parameter to **true** and click **Save** to save the configuration. On the **Dashboard** tab page of YARN, choose **More** > **Restart Service**, verify the identity, and click **OK**. After the service is restarted, go to :ref:`1.c `. + + c. .. _admin_guide_000133__l62a2981fbee6466387b51feb634bb77f: + + Choose **Cluster** > *Name of the desired cluster* > **Services**, and check whether the YARN service status is normal. + +#. Log in to the active management node as user **omm**. + +#. Switch the scheduler. + + The following switching modes are available: + + **0**: converts the Capacity scheduler configurations into the Superior scheduler configurations and then switches the Capacity scheduler to the Superior scheduler. + + **1**: converts the Capacity scheduler configurations into the Superior scheduler configurations only. + + **2**: switches the Capacity scheduler to the Superior scheduler only. + + - Mode **0** is recommended if the cluster environment is simple and the number of tenants is less than 20. + + Run the following command: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/switchScheduler.sh** **-c** *Cluster ID* **-m 0** + + .. note:: + + You can choose **Cluster**, click the cluster name, and choose **Cluster Properties** on FusionInsight Manager to view the cluster ID. + + .. code-block:: + + Start to convert Capacity scheduler to Superior Scheduler, clusterId=1 + Start to convert Capacity scheduler configurations to Superior. Please wait... + Convert configurations successfully. + Start to switch the Yarn scheduler to Superior. Please wait... + Switch the Yarn scheduler to Superior successfully. + + - If the cluster environment or tenant information is complex and you need to retain the queue configurations of the Capacity scheduler on the Superior scheduler, it is recommended that you use mode **1** first to convert the Capacity scheduler configurations, check the converted configurations, and then use mode **2** to switch the Capacity scheduler to the Superior scheduler. + + a. Run the following command to convert the Capacity scheduler configurations into the Superior scheduler configurations: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/switchScheduler.sh -c** *Cluster ID* **-m 1** + + .. code-block:: + + Start to convert Capacity scheduler to Superior Scheduler, clusterId=1 + Start to convert Capacity scheduler configurations to Superior. Please wait... + Convert configurations successfully. + + b. Run the following command to switch the Capacity scheduler to the Superior scheduler: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/switchScheduler.sh -c** *Cluster ID* **-m 2** + + .. code-block:: + + Start to convert Capacity scheduler to Superior Scheduler, clusterId=1 + Start to switch the Yarn scheduler to Superior. Please wait... + Switch the Yarn scheduler to Superior successfully. + + - If you do not need to retain the queue configurations of the Capacity scheduler, use mode **2**. + + a. Log in to FusionInsight Manager and delete all tenants except the default tenant. + + b. On FusionInsight Manager, delete all resource pools except the default resource pool. + + Run the following command to switch the Capacity scheduler to the Superior scheduler: + + **sh ${BIGDATA_HOME}/om-server/om/sbin/switchScheduler.sh -c** *Cluster ID* **-m 2** + + .. code-block:: + + Start to convert Capacity scheduler to Superior Scheduler, clusterId=1 + Start to switch the Yarn scheduler to Superior. Please wait... + Switch the Yarn scheduler to Superior successfully. + + .. note:: + + You can query the scheduler switching logs on the active management node. + + - ${BIGDATA_LOG_HOME}/controller/aos/switch_scheduler.log + - ${BIGDATA_LOG_HOME}/controller/aos/aos.log diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_sub-tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_sub-tenant.rst new file mode 100644 index 0000000..f84e49d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_sub-tenant.rst @@ -0,0 +1,116 @@ +:original_name: admin_guide_000119.html + +.. _admin_guide_000119: + +Adding a Sub-Tenant +=================== + +Scenario +-------- + +You can create sub-tenants on FusionInsight Manager and allocate resources of the current tenant to the sub-tenants based on the resource consumption and isolation planning and requirements of services. + +Prerequisites +------------- + +- A parent non-leaf tenant has been added. +- A tenant name has been planned based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. +- Resources to be allocated to the current tenant have been planned to ensure that the sum of resources of direct sub-tenants at each level does not exceed the resources of the current tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, select a parent tenant and click |image1|. On the page for adding a sub-tenant, set attributes for the sub-tenant according to :ref:`Table 1 `. + + .. _admin_guide_000119__tc983b52ccd084798871c7fa2b49856dd: + + .. table:: **Table 1** Sub-tenant parameters + + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +========================================+==========================================================================================================================================================================================================================================================================================+ + | Cluster | Indicates the cluster to which the parent tenant belongs. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parent Tenant Resource | Indicates the name of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | - Indicates the name of the current tenant. The value consists of 3 to 50 characters, including digits, letters, and underscores (_). | + | | - Plan a sub-tenant name based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | Specifies whether the tenant is a leaf tenant. | + | | | + | | - When **Leaf Tenant** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. | + | | - When **Non-leaf Tenant** is selected, the current tenant is not a leaf tenant and sub-tenants can be added to the current tenant. However, the tenant depth cannot exceed 5 levels. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Computing Resource | Specifies the dynamic computing resources for the current tenant. | + | | | + | | - When **Yarn** is selected, the system automatically creates a queue in Yarn and the queue is named the same as the sub-tenant name. | + | | | + | | - A leaf tenant can directly submit jobs to the queue. | + | | - A non-leaf tenant cannot directly submit jobs to the queue. However, Yarn adds an extra queue (hidden) named **default** for the non-leaf tenant to record the remaining resource capacity of the tenant. Actual jobs do not run in this queue. | + | | | + | | - If **Yarn** is not selected, the system does not automatically create a queue. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Indicates the percentage of computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max Capacity (%) | Indicates the maximum percentage of computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. | + | | | + | | - When **HDFS** is selected, the system automatically creates a folder named after the sub-tenant in the HDFS parent tenant directory. | + | | - When **HDFS** is not selected, the system does not automatically allocate storage resources. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Quota | Indicates the quota for files and directories. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota | Indicates the quota for the HDFS storage space used by the current tenant. | + | | | + | | - If the unit is set to **MB**, the value ranges from **1** to **8796093022208**. If the unit is set to **GB**, the value ranges from **1** to **8589934592**. | + | | - This parameter indicates the maximum HDFS storage space that can be used by the tenant, but not the actual space used. | + | | - If its value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + | | - If this quota is greater than the quota of the parent tenant, the actual storage space does not exceed the quota of the parent tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Indicates the HDFS storage directory for the tenant. | + | | | + | | - The system automatically creates a folder named after the sub-tenant name in the directory of the parent tenant by default. For example, if the sub-tenant is **ta1s** and the parent directory is **/tenant/ta1**, the storage path for the sub-tenant is then **/tenant/ta1/ta1s**. | + | | - The storage path is customizable in the parent directory. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current tenant. | + +----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Roles, computing resources, and storage resources are automatically created when tenants are created. + + - The new role has permissions on the computing and storage resources. This role and its permissions are automatically controlled by the system and cannot be manually managed by choosing **System** > **Permission** > **Role**. The role name is in the format of *Tenant name*\ \_\ *Cluster ID*. The ID of the first cluster is not displayed by default. + - When using this tenant, create a system user and bind the user to the role of the tenant. For details, see :ref:`Adding a User and Binding the User to a Tenant Role `. + - The sub-tenant can further allocate the resources of its parent tenant. The sum of the resource percentages of direct sub-tenants under a parent tenant at each level cannot exceed 100%. The sum of the computing resource percentages of all level-1 tenants cannot exceed 100%. + +#. Check whether the current tenant needs to be associated with resources of other services. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _admin_guide_000119__lcdfcd36b99d84c3ba2f290f976ade15b: + + Click **Associate Service** to configure other service resources used by the current tenant. + + a. Set **Services** to **HBase**. + b. Set **Association Type** as follows: + + - **Exclusive** indicates that the service resources are used by the tenant exclusively and cannot be associated with other tenants. + - **Shared** indicates that the service resources can be shared with other tenants. + + .. note:: + + - Only HBase can be associated with a new tenant. However, HDFS, HBase, and Yarn can be associated with existing tenants. + - To associate an existing tenant with service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Associate Service** to configure resources to be associated with the tenant. + - To disassociate an existing tenant from service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Delete** in the **Operation** column. In the displayed dialog box, select **I have read the information and understand the impact** and click **OK**. + + c. Click **OK**. + +#. .. _admin_guide_000119__l93b6a287f2a9444f9b34fcbcc1e595ac: + + Click **OK**. Wait until the system displays a message indicating that the tenant is successfully created. + +.. |image1| image:: /_static/images/en-us_image_0263899238.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_tenant.rst new file mode 100644 index 0000000..8c5fde2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_tenant.rst @@ -0,0 +1,125 @@ +:original_name: admin_guide_000118.html + +.. _admin_guide_000118: + +Adding a Tenant +=============== + +Scenario +-------- + +You can create tenants on FusionInsight Manager based on the resource consumption and isolation planning and requirements of services. + +Prerequisites +------------- + +- A tenant name has been planned based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. +- Resources to be allocated to the current tenant have been planned to ensure that the sum of resources of direct sub-tenants at each level does not exceed the resources of the current tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. Click |image1|. On the page that is displayed, configure tenant attributes according to :ref:`Table 1 `. + + .. _admin_guide_000118__t41dbef6c05f84b128695138843bed278: + + .. table:: **Table 1** Tenant parameters + + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +====================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Cluster | Indicates the cluster for which you want to create a tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | - Indicates the name of the current tenant. The value consists of 3 to 50 characters, including digits, letters, and underscores (_). | + | | - Plan a tenant name based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Resource Type | Specifies whether the tenant is a leaf tenant. | + | | | + | | - When **Leaf Tenant** **Resource** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. | + | | - When **Non-leaf Tenant** **Resource** is selected, the current tenant is not a leaf tenant and sub-tenants can be added to the current tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Computing Resource | Specifies the dynamic computing resources for the current tenant. | + | | | + | | - When **Yarn** is selected, the system automatically creates a queue in Yarn and the queue is named the same as the tenant name. | + | | | + | | - A leaf tenant can directly submit jobs to the queue. | + | | - A non-leaf tenant cannot directly submit jobs to the queue. However, Yarn adds an extra queue (hidden) named **default** for the non-leaf tenant to record the remaining resource capacity of the tenant. Actual jobs do not run in this queue. | + | | | + | | - If **Yarn** is not selected, the system does not automatically create a queue. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Mode | Indicates the configuration mode of computing resource parameters. | + | | | + | | - If you select **Basic**, you only need to set **Default Resource Pool Capacity (%)**. | + | | - If you select **Advanced**, you can manually configure the resource allocation weight and the minimum, maximum, and reserved resources of the tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Indicates the percentage of computing resources used by the current tenant in the default resource pool. The value ranges from **0** to **100%**. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Weight | Indicates the resource allocation weight. The value ranges from **0** to **100**. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum Resource | Indicates the resources guaranteed for the tenant (preemption supported). The value can be a percentage or an absolute value of the parent tenant's resources. When a tenant has a light workload, the resources of the tenant are automatically allocated to other tenants. When the available tenant resources are less than the value of **Minimum Resource**, the tenant can preempt the resources that have been lent to other tenants. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum Resource | Indicates the maximum resources that can be used by the tenant. The tenant cannot obtain more resources than the value configured. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Reserved Resource | Indicates the resources reserved for the tenant. The reserved resources cannot be used by other tenants even if no job is running in the current tenant resources. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. | + | | | + | | - When **HDFS** is selected, the system automatically allocates storage resources. | + | | - When **HDFS** is not selected, the system does not automatically allocate storage resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Quota | Indicates the quota for files and directories. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota | Indicates the quota for the HDFS storage space used by the current tenant. | + | | | + | | - If the unit is set to **MB**, the value ranges from **1** to **8796093022208**. If the unit is set to **GB**, the value ranges from **1** to **8589934592**. | + | | - This parameter indicates the maximum HDFS storage space that can be used by the tenant, but not the actual space used. | + | | - If its value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Indicates the HDFS storage directory for the tenant. | + | | | + | | - The system automatically creates a folder named after the tenant name in the **/tenant** directory by default. For example, the default HDFS storage directory for tenant **ta1** is **/tenant/ta1**. | + | | - When a tenant is created for the first time, the system creates the **/tenant** directory in the HDFS root directory. The storage path is customizable. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies whether to associate resources of other services. For details, see :ref:`4 `. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Roles, computing resources, and storage resources are automatically created when tenants are created. + + - The new role has permissions on the computing and storage resources. This role and its permissions are automatically controlled by the system and cannot be manually managed by choosing **System** > **Permission** > **Role**. The role name is in the format of *Tenant name*\ \_\ *Cluster ID*. The ID of the first cluster is not displayed by default. + - When using this tenant, create a system user and bind the user to the role of the tenant. For details, see :ref:`Adding a User and Binding the User to a Tenant Role `. + - During the tenant creation, the system automatically creates a Yarn queue named after the tenant. If the queue name already exists, the new queue is named **Tenant name-**\ *N*. *N* indicates a natural number starting from **1**. When a same name exists, the value *N* increases automatically to differentiate the queue from others. For example, **saletenant**, **saletenant-1**, and **saletenant-2**. + +#. Check whether the current tenant needs to be associated with resources of other services. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _admin_guide_000118__l95df8df02a794fd7adb2f27cfcb5c042: + + Click **Associate Service** to configure other service resources used by the current tenant. + + a. Set **Services** to **HBase**. + b. Set **Association Type** as follows: + + - **Exclusive** indicates that the service resources are used by the tenant exclusively and cannot be associated with other tenants. + - **Shared** indicates that the service resources can be shared with other tenants. + + .. note:: + + - Only HBase can be associated with a new tenant. However, HDFS, HBase, and Yarn can be associated with existing tenants. + - To associate an existing tenant with service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Associate Service** to configure resources to be associated with the tenant. + - To disassociate an existing tenant from service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Delete** in the **Operation** column. In the displayed dialog box, select **I have read the information and understand the impact** and click **OK**. + + c. Click **OK**. + +#. .. _admin_guide_000118__lea52c6efc12849b4aca946b1c510728d: + + Click **OK**. Wait until the system displays a message indicating that the tenant is successfully created. + +.. |image1| image:: /_static/images/en-us_image_0263899570.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst new file mode 100644 index 0000000..670076d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst @@ -0,0 +1,68 @@ +:original_name: admin_guide_000120.html + +.. _admin_guide_000120: + +Adding a User and Binding the User to a Tenant Role +=================================================== + +Scenario +-------- + +A newly created tenant cannot directly log in to the cluster to access resources. You need to add a user for the tenant on FusionInsight Manager and bind the user to the role of the tenant to assign operation permissions to the user. + +Prerequisites +------------- + +You have clarified service requirements and created a tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **System** > **Permission** > **User**. + +#. If you want to add a user to the system, click **Create**. + + If you want to bind tenant roles to an existing user in the system, locate the row of the user and click **Modify** in the **Operation** column. + + Set user attributes according to :ref:`Table 1 `. + + .. _admin_guide_000120__t2b6451d372c44135bf8473b6b2dc0fd4: + + .. table:: **Table 1** User parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================================================================+ + | Username | Specifies the current user name. The value can contain 3 to 32 characters, including digits, letters, underscores (_), hyphens (-), and spaces. | + | | | + | | - The username cannot be the same as the OS username of any node in the cluster. Otherwise, the user cannot be used. | + | | - A username that differs only in alphabetic case from an existing username is not allowed. For example, if **User1** has been created, you cannot create **user1**. Enter the correct username when using **User1**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Type | The options are **Human-Machine** and **Machine-Machine**. | + | | | + | | - **Human-Machine** user: used for FusionInsight Manager O&M and component client operations. If you select this option, set both **Password** and **Confirm Password** accordingly. | + | | - **Machine-Machine** user: used for application development. If you select this option, the password is randomly generated. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password | This parameter is mandatory if **User Type** is set to **Human-Machine**. | + | | | + | | The password must contain 8 to 64 characters of at least four types of the following: uppercase letters, lowercase letters, digits, special characters, and spaces. The password cannot be the username or the username spelled backwards. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Enter the password again. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Group | In the **User Group** area, click **Add** and select user groups to add the user to the groups. | + | | | + | | - If roles have been added to the user groups, the user can be granted the permissions of the roles. | + | | - For example, add the user to the Hive user group to assign Hive permissions to the user. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Primary Group | Select a group as the primary group for the user to create directories and files. The drop-down list contains all groups selected in **User Group**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Role | Click **Add** to bind a tenant role to the user. | + | | | + | | .. note:: | + | | | + | | If a user wants to use the resources of tenant **tenant1** and to add or delete sub-tenants for **tenant1**, the user must be bound to both the **Manager_tenant** and **tenant1\_**\ *Cluster ID* roles. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current user. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/index.rst new file mode 100644 index 0000000..0ad4363 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/creating_tenants/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000117.html + +.. _admin_guide_000117: + +Creating Tenants +================ + +- :ref:`Adding a Tenant ` +- :ref:`Adding a Sub-Tenant ` +- :ref:`Adding a User and Binding the User to a Tenant Role ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + adding_a_tenant + adding_a_sub-tenant + adding_a_user_and_binding_the_user_to_a_tenant_role diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/index.rst new file mode 100644 index 0000000..25b1961 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000116.html + +.. _admin_guide_000116: + +Using the Capacity Scheduler +============================ + +- :ref:`Creating Tenants ` +- :ref:`Managing Tenants ` +- :ref:`Managing Resources ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_tenants/index + managing_tenants/index + managing_resources/index diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/adding_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/adding_a_resource_pool.rst new file mode 100644 index 0000000..b68344c --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/adding_a_resource_pool.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000127.html + +.. _admin_guide_000127: + +Adding a Resource Pool +====================== + +Scenario +-------- + +In a cluster, you can logically group Yarn NodeManagers into Yarn resource pools. Each NodeManager belongs to only one resource pool. You can create a custom resource pool on FusionInsight Manager and add the hosts that have not been added to any custom resource pools to this resource pool so that specified queues can use the computing resources provided by these hosts. + +The system contains a **default** resource pool by default. All NodeManagers that are not added to custom resource pools belong to this resource pool. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Tenant Resources** > **Resource Pool**. + +#. Click **Add Resource Pool**. + +#. Set resource pool attributes. + + - **Cluster**: Select the cluster to which the resource pool is to be added. + - **Name**: Enter the name of the resource pool. The name contains 1 to 50 characters, including digits, letters, and underscores (_), and cannot start with an underscore (_). + - **Resource Label**: Enter the resource label of the resource pool. The value can contain 1 to 50 characters, including digits, letters, underscores (_), and hyphens (-), and must start with a digit or letter. + - **Resource**: In the **Available Hosts** area, select specified hosts and click |image1| to add the hosts to the **Selected Hosts** area. Only hosts in the cluster can be selected. The host list in the resource pool can be left blank. + + .. note:: + + You can filter hosts by host name, number of CPU cores, memory, operating system, or platform type based on service requirements. + +#. Click **OK**. + + After the resource pool is created, you can view its name, members, and mode in the resource pool list. Hosts that are added to the custom resource pool are no longer members of the **default** resource pool. + +.. |image1| image:: /_static/images/en-us_image_0263899376.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/clearing_queue_configurations.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/clearing_queue_configurations.rst new file mode 100644 index 0000000..ba26636 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/clearing_queue_configurations.rst @@ -0,0 +1,25 @@ +:original_name: admin_guide_000132.html + +.. _admin_guide_000132: + +Clearing Queue Configurations +============================= + +Scenario +-------- + +You can clear the configurations of a queue on FusionInsight MRS Manager when the queue does not need resources of a resource pool or the resource pool needs to be disassociated from the queue. Clearing queue configurations cancels the resource capacity policy of the queue in the resource pool. + +Prerequisites +------------- + +You have changed the default resource pool of the queue to another one. If a queue is to be disassociated from a resource pool, this resource pool cannot serve as the default resource pool of the queue. For details, see :ref:`Configuring a Queue `. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Dynamic Resource Plan**. +#. Select the name of the target cluster from **Cluster** and select a resource pool from **Resource Pool**. +#. Locate the row that contains the target resource name in the **Resource Allocation** area, and click **Clear** in the **Operation** column. +#. In the displayed dialog box, click **OK** to clear the queue configurations from the current resource pool. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_a_queue.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_a_queue.rst new file mode 100644 index 0000000..a3df97d --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_a_queue.rst @@ -0,0 +1,62 @@ +:original_name: admin_guide_000130.html + +.. _admin_guide_000130: + +Configuring a Queue +=================== + +Scenario +-------- + +You can modify the queue configurations for a specified tenant on FusionInsight Manager. + +Prerequisites +------------- + +A tenant who uses the Capacity scheduler has been added. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Tenant Resources** > **Dynamic Resource Plan**. + + The **Resource Distribution Policy** page is displayed by default. + +#. Click the **Queue Configurations** tab. + +#. Set **Cluster** to the name of the target cluster. In **All tenants resources** area, locate the row that contains the target tenant resource and click **Modify** in the **Operation** column. + + .. note:: + + - You can also access the **Modify Queue Configuration** page as follows: In the tenant list on the **Tenant Resources Management** page, click the target tenant, click the **Resource** tab, and click |image1| next to **Queue Configurations (**\ *Queue name*\ **)**. + - A queue can be bound to only one non-default resource pool. That is, a newly added resource pool can be bound to only one queue to serve as the default resource pool of the queue. + + .. table:: **Table 1** Queue configuration parameters + + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===============================================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Tenant Resources Name (Queue) | Indicates the tenant name and queue name. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum Applications | Indicates the maximum number of applications. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum AM Resource Percent | Indicates the maximum percentage of resources that can be used to run the ApplicationMaster in a cluster. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum User Resource Upper-Limit Percent (%) | Indicates the minimum resource guarantee (percentage) of a user. The resources for each user in a queue are limited at any time. If applications of multiple users are running at the same time in a queue, the resource usage of each user fluctuates between the minimum value and the maximum value. The minimum value is determined by the number of running applications, while the maximum value is determined by this parameter. | + | | | + | | For example, assume that this parameter is set to **25**. If two users submit applications to the queue, each user can use a maximum of 50% resources; if three users submit applications to the queue, each user can use a maximum of 33% resources; if four users submit applications to the queue, each user can use a maximum of 25% resources. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Resource Upper-Limit Factor | Indicates the limit factor of the maximum user resource usage. The maximum user resource usage percentage can be obtained by multiplying the limit factor with the percentage of the tenant's actual resource usage in the cluster. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Indicates the current status of a resource plan. The value can be **Running** or **Stopped**. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool | Indicates the resource pool used by the queue. The default value is **default**. | + | | | + | | If you want to change the resource pool, configure the queue capacity first. For details, see :ref:`Configuring the Queue Capacity Policy of a Resource Pool `. | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001369765657.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst new file mode 100644 index 0000000..f602f7f --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst @@ -0,0 +1,42 @@ +:original_name: admin_guide_000131.html + +.. _admin_guide_000131: + +Configuring the Queue Capacity Policy of a Resource Pool +======================================================== + +Scenario +-------- + +After a resource pool is added, you can configure the capacity policy of available resources for Yarn queues so that jobs in the queues can be properly executed in the resource pool. A queue can have the queue capacity policy of only one resource pool. + +You can view queues and configure queue capacity policies in any resource pool. After the queue policies are configured, Yarn queues are associated with resource pools. + +Prerequisites +------------- + +A queue has been added, that is, a tenant associated with computing resources has been created. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Tenant Resources** > **Dynamic Resource Plan**. + + The **Resource Distribution Policy** page is displayed by default. + +#. Select the name of the target cluster from **Cluster** and select a resource pool from **Resource Pool**. + +#. Locate the row that contains the target resource name in the **Resource Allocation** area, and click **Modify** in the **Operation** column. + +#. In the **Modify Resource Allocation** window, configure the resource capacity policy of the queue in the resource pool. + + - **Capacity (%)**: indicates the percentage of computing resources used by the current tenant. + - **Maximum Capacity (%)**: indicates the maximum percentage of computing resources used by the current tenant. + +#. Click **OK**. + + .. note:: + + After the resource capacity values of a queue are deleted and saved, the resource capacity policy of the queue in the resource pool is canceled, indicating that the queue is disassociated from the resource pool. To achieve this, you need to change the default resource pool of the queue to another one. For details, see :ref:`Configuring a Queue `. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/deleting_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/deleting_a_resource_pool.rst new file mode 100644 index 0000000..bc812c8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/deleting_a_resource_pool.rst @@ -0,0 +1,25 @@ +:original_name: admin_guide_000129.html + +.. _admin_guide_000129: + +Deleting a Resource Pool +======================== + +Scenario +-------- + +If a resource pool is no longer used based on service requirements, you can delete it on FusionInsight Manager. + +Prerequisites +------------- + +- Any queue in the cluster does not use the resource pool to be deleted as the default resource pool. Before deleting the resource pool, cancel the default resource pool. For details, see :ref:`Configuring a Queue `. +- Resource distribution policies of all queues have been cleared from the resource pool to be deleted. For details, see :ref:`Clearing Queue Configurations `. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Resource Pool**. +#. Locate the row that contains the specified resource pool, and click **Delete** in the **Operation** column. +#. In the displayed dialog box, click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/index.rst new file mode 100644 index 0000000..b6a10ea --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/index.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000126.html + +.. _admin_guide_000126: + +Managing Resources +================== + +- :ref:`Adding a Resource Pool ` +- :ref:`Modifying a Resource Pool ` +- :ref:`Deleting a Resource Pool ` +- :ref:`Configuring a Queue ` +- :ref:`Configuring the Queue Capacity Policy of a Resource Pool ` +- :ref:`Clearing Queue Configurations ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + adding_a_resource_pool + modifying_a_resource_pool + deleting_a_resource_pool + configuring_a_queue + configuring_the_queue_capacity_policy_of_a_resource_pool + clearing_queue_configurations diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/modifying_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/modifying_a_resource_pool.rst new file mode 100644 index 0000000..ad77747 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_resources/modifying_a_resource_pool.rst @@ -0,0 +1,27 @@ +:original_name: admin_guide_000128.html + +.. _admin_guide_000128: + +Modifying a Resource Pool +========================= + +Scenario +-------- + +When hosts in a resource pool need to be adjusted based on service requirements, you can modify members in the resource pool on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Resource Pool**. +#. Locate the row that contains the specified resource pool, and click **Edit** in the **Operation** column. +#. In the **Resource** area, modify hosts. + + - Adding hosts: Select desired hosts in **Available Hosts** and click |image1| to add them to the resource pool. + - Deleting hosts: Select desired hosts in **Selected Hosts** and click |image2| to remove them from the resource pool. The host list in the resource pool can be left blank. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899635.png +.. |image2| image:: /_static/images/en-us_image_0263899280.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/clearing_non-associated_queues_of_a_tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/clearing_non-associated_queues_of_a_tenant.rst new file mode 100644 index 0000000..0098778 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/clearing_non-associated_queues_of_a_tenant.rst @@ -0,0 +1,57 @@ +:original_name: admin_guide_000125.html + +.. _admin_guide_000125: + +Clearing Non-associated Queues of a Tenant +========================================== + +Scenario +-------- + +If Yarn uses the Capacity scheduler, deleting a tenant only sets the queue capacity of the tenant to **0** and the tenant status to **STOPPED** but does not clear the queues of the tenant in Yarn. Limited by the Yarn mechanism, queues cannot be dynamically deleted. You can run commands to manually delete residual queues. + +Impact on the System +-------------------- + +- During the script execution, the Controller service is restarted, Yarn configurations are synchronized, and the active and standby ResourceManagers are restarted. +- FusionInsight Manager becomes inaccessible during the restart of the Controller service. +- After the active and standby ResourceManagers are restarted, an alarm is generated indicating that Yarn and components that depend on Yarn are temporarily unavailable. + +Prerequisites +------------- + +Queues of a deleted tenant still exist. + +Procedure +--------- + +#. Check that queues of the deleted tenant still exist. + + a. On FusionInsight Manager, choose **Cluster**, click the name of the target cluster, and choose **Services** > **Yarn**. Click the link of the active ResourceManager in **ResourceManager WebUI** to go to the ResourceManager web UI. + b. Click **Scheduler** in the navigation tree on the left. In the right pane, you can view that queues of the tenant still exist in the **STOPPED** state and their **Configured Capacity** is **0**. + +#. Log in to the active management node as user **omm**. + +#. Switch the directory and execute the **cleanQueuesAndRestartRM.sh** script. + + **cd ${BIGDATA_HOME}/om-server/om/sbin** + + **./cleanQueuesAndRestartRM.sh** **-c** *Cluster ID* + + .. note:: + + You can choose **Cluster**, click the cluster name, and choose **Cluster Properties** on FusionInsight Manager to view the cluster ID. + + During the script execution, you need to enter **yes** and the password. + + .. code-block:: + + Running the script will restart Controller and restart ResourceManager. + Are you sure you want to continue connecting (yes/no)?yes + Please input admin password: + Begin to backup queues ... + ... + +#. After the script is executed successfully, log in to FusionInsight Manager, choose **Cluster**, click the cluster name, and choose **Services** > **Yarn**. Click the link of the active ResourceManager in **ResourceManager WebUI** to go to the ResourceManager web UI. + +#. Click **Scheduler** in the navigation tree on the left. In the right pane, you can view that queues of the tenant have been cleared. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/deleting_a_tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/deleting_a_tenant.rst new file mode 100644 index 0000000..91e0d00 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/deleting_a_tenant.rst @@ -0,0 +1,40 @@ +:original_name: admin_guide_000124.html + +.. _admin_guide_000124: + +Deleting a Tenant +================= + +Scenario +-------- + +You can delete tenants that are no longer used on FusionInsight Manager based on service requirements to release resources occupied by the tenants. + +Prerequisites +------------- + +- A tenant has been added. +- The tenant has no sub-tenants. If the tenant has sub-tenants, delete them; otherwise, the tenant cannot be deleted. +- The role of the tenant is not associated with any user or user group. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, click the target tenant and click |image1|. + + .. note:: + + - If you want to retain the tenant data, select **Reserve the data of this tenant resource**. Otherwise, the storage space of the tenant will be deleted. + - To delete a tenant without retaining the tenant data as a user who does not belong to the supergroup, you should first log in to the HDFS client as a user who belongs to the supergroup and then manually clear the storage space of that tenant to avoid residual data. + +#. Click **OK**. + + It takes a few minutes to save the configuration. After the tenant is deleted, the role and storage space of the tenant are also deleted. + + .. note:: + + After the tenant is deleted, the queue of the tenant still exists in Yarn. The queue of the tenant is not displayed on the role management page in Yarn. + +.. |image1| image:: /_static/images/en-us_image_0000001376041769.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/index.rst new file mode 100644 index 0000000..7b84969 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000121.html + +.. _admin_guide_000121: + +Managing Tenants +================ + +- :ref:`Managing Tenant Directories ` +- :ref:`Restoring Tenant Data ` +- :ref:`Deleting a Tenant ` +- :ref:`Clearing Non-associated Queues of a Tenant ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_tenant_directories + restoring_tenant_data + deleting_a_tenant + clearing_non-associated_queues_of_a_tenant diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/managing_tenant_directories.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/managing_tenant_directories.rst new file mode 100644 index 0000000..9518ad8 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/managing_tenant_directories.rst @@ -0,0 +1,92 @@ +:original_name: admin_guide_000122.html + +.. _admin_guide_000122: + +Managing Tenant Directories +=========================== + +Scenario +-------- + +You can manage the HDFS storage directories used by specified tenants based on service requirements on FusionInsight Manager, such as adding tenant directories, changing the quotas for directories and files and for storage space, and deleting directories. + +Prerequisites +------------- + +A tenant with HDFS storage resources has been added. + +Viewing a Tenant Directory +-------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. View the **HDFS Storage** table. + + - The **File Number Threshold** column provides the quota for files and directories of the tenant directory. + - The **Space Quota** column provides the storage space size of the tenant directory. + +Adding a Tenant Directory +------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** area, click **Create Directory**. + + - **Parent Directory**: indicates the storage directory used by the parent tenant of the current tenant. + + .. note:: + + This parameter is not displayed if the current tenant is not a sub-tenant. + + - Set **Path** to a tenant directory path. + + .. note:: + + If the current tenant is not a sub-tenant, the new path is created in the HDFS root directory. + + - Set **Quota** to the quota for files and directories. + - **File Number Threshold (%)** is valid only when **Quota** is set. If the ratio of the number of used files to the value of **Quota** exceeds the value of this parameter, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + + .. note:: + + The number of used files is collected every hour. Therefore, the alarm indicating that the ratio of used files exceeds the threshold is delayed. + + - Set **Space Quota** to the storage space size of the tenant directory. + - If the ratio of used storage space to the value of **Space Quota** exceeds the **Storage Space Threshold (%)** value, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + + .. note:: + + The used storage space is collected every hour. Therefore, the alarm indicating that the ratio of used storage space exceeds the threshold is delayed. + +#. Click **OK**. + +Modifying a Tenant Directory +---------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** table, click **Modify** in the **Operation** column of the specified tenant directory. + + - Set **Quota** to the quota for files and directories. + - **File Number Threshold (%)** is valid only when **Quota** is set. If the ratio of the number of used files to the value of **Quota** exceeds the value of this parameter, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + - Set **Space Quota** to the storage space size of the tenant directory. + - If the ratio of used storage space to the value of **Space Quota** exceeds the **Storage Space Threshold (%)** value, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + +#. Click **OK**. + +Deleting a Tenant Directory +--------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** table, click **Delete** in the **Operation** column of the specified tenant directory. + + .. note:: + + The tenant directory that is created by the system during tenant creation cannot be deleted. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/restoring_tenant_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/restoring_tenant_data.rst new file mode 100644 index 0000000..35023f5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_capacity_scheduler/managing_tenants/restoring_tenant_data.rst @@ -0,0 +1,33 @@ +:original_name: admin_guide_000123.html + +.. _admin_guide_000123: + +Restoring Tenant Data +===================== + +Scenario +-------- + +Tenant data is stored on FusionInsight Manager and cluster components. When components are recovered from failures or reinstalled, some configuration data of all tenants may become abnormal. In this case, you need to manually restore the configuration data on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, click the target tenant. + +#. Check the tenant data status. + + a. On the **Summary** page, check **Tenant Status**. A green icon indicates that the tenant is available and gray indicates that the tenant is unavailable. + b. Click **Resource** and check the icons on the left of **Yarn** and **HDFS Storage**. A green icon indicates that the resource is available, and gray indicates that the resource is unavailable. + c. Click **Service Associations** and check the **Status** column of the associated services. **Normal** indicates that the component can provide services for the associated tenant. **Not Available** indicates that the component cannot provide services for the tenant. + d. If any of the preceding check items is abnormal, go to :ref:`4 ` to restore tenant data. + +#. .. _admin_guide_000123__l62f85b027a17495484c1162c5dd730f1: + + Click |image1|. In the displayed dialog box, enter the password of the current login user and click **OK**. + +#. In the **Restore Tenant Resource Data** window, select one or more components to restore data, and click **OK**. The system automatically restores the tenant data. + +.. |image1| image:: /_static/images/en-us_image_0263899446.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_sub-tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_sub-tenant.rst new file mode 100644 index 0000000..a2ce2fc --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_sub-tenant.rst @@ -0,0 +1,129 @@ +:original_name: admin_guide_000102.html + +.. _admin_guide_000102: + +Adding a Sub-Tenant +=================== + +Scenario +-------- + +You can create sub-tenants on FusionInsight Manager and allocate resources of the current tenant to the sub-tenants based on the resource consumption and isolation planning and requirements of services. + +Prerequisites +------------- + +- A parent non-leaf tenant has been added. +- A sub-tenant name has been planned based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. +- Resources to be allocated to the current tenant have been planned to ensure that the sum of resources of direct sub-tenants at each level does not exceed the resources of the current tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, select a parent tenant and click |image1|. On the page for adding a sub-tenant, set attributes for the sub-tenant according to :ref:`Table 1 `. + + .. _admin_guide_000102__en-us_topic_0165590323_tc983b52ccd084798871c7fa2b49856dd: + + .. table:: **Table 1** Sub-tenant parameters + + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +====================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Cluster | Indicates the cluster to which the parent tenant belongs. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parent Tenant Resource | Indicates the name of the parent tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | - Indicates the name of the current tenant. The value consists of 3 to 50 characters, including digits, letters, and underscores (_). | + | | - Plan a sub-tenant name based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Resource Type | Specifies whether the tenant is a leaf tenant. | + | | | + | | - When **Leaf Tenant** **Resource** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. | + | | - When **Non-leaf Tenant** **Resource** is selected, the current tenant is not a leaf tenant and sub-tenants can be added to the current tenant. However, the tenant depth cannot exceed 5 levels. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Computing Resource | Specifies the dynamic computing resources for the current tenant. | + | | | + | | - When **Yarn** is selected, the system automatically creates a queue in Yarn and the queue is named the same as the sub-tenant name. | + | | | + | | - A leaf tenant can directly submit jobs to the queue. | + | | - A non-leaf tenant cannot directly submit jobs to the queue. However, Yarn adds an extra queue (hidden) named **default** for the non-leaf tenant to record the remaining resource capacity of the tenant. Actual jobs do not run in this queue. | + | | | + | | - If **Yarn** is not selected, the system does not automatically create a queue. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Mode | Indicates the configuration mode of computing resource parameters. | + | | | + | | - If you select **Basic**, you only need to set **Default Resource Pool Capacity (%)**. | + | | - If you select **Advanced**, you can manually configure the resource allocation weight and the minimum, maximum, and reserved resources of the tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Indicates the percentage of computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Weight | Indicates the resource allocation weight. The value ranges from **0** to **100**. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum Resource | Indicates the resources guaranteed for the tenant (preemption supported). The value can be a percentage or an absolute value of the parent tenant's resources. When a tenant has a light workload, the resources of the tenant are automatically allocated to other tenants. When the available tenant resources are less than the value of **Minimum Resource**, the tenant can preempt the resources that have been lent to other tenants. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum Resource | Indicates the maximum resources that can be used by the tenant. The tenant cannot obtain more resources than the value configured. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Reserved Resource | Indicates the resources reserved for the tenant. The reserved resources cannot be used by other tenants even if no job is running in the current tenant resources. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. | + | | | + | | - When **HDFS** is selected, the system automatically creates a folder named after the sub-tenant in the HDFS parent tenant directory. | + | | - When **HDFS** is not selected, the system does not automatically allocate storage resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Quota | Indicates the quota for files and directories. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota | Indicates the quota for the HDFS storage space used by the current tenant. | + | | | + | | - If the unit is set to **MB**, the value ranges from **1** to **8796093022208**. If the unit is set to **GB**, the value ranges from **1** to **8589934592**. | + | | - This parameter indicates the maximum HDFS storage space that can be used by the tenant, but not the actual space used. | + | | - If its value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + | | - If this quota is greater than the quota of the parent tenant, the actual storage space does not exceed the quota of the parent tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Indicates the HDFS storage directory for the tenant. | + | | | + | | - The system automatically creates a folder named after the sub-tenant name in the directory of the parent tenant by default. For example, if the sub-tenant is **ta1s** and the parent directory is **/tenant/ta1**, the storage path for the sub-tenant is then **/tenant/ta1/ta1s**. | + | | - The storage path is customizable in the parent directory. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies whether to associate resources of other services. For details, see :ref:`4 `. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Roles, computing resources, and storage resources are automatically created when tenants are created. + + - The new role has permissions on the computing and storage resources. This role and its permissions are automatically controlled by the system and cannot be manually managed by choosing **System** > **Permission** > **Role**. The role name is in the format of *Tenant name*\ \_\ *Cluster ID*. The ID of the first cluster is not displayed by default. + - When using this tenant, create a system user and bind the user to the role of the tenant. For details, see :ref:`Adding a User and Binding the User to a Tenant Role `. + - The sub-tenant can further allocate the resources of its parent tenant. The sum of the resource percentages of direct sub-tenants under a parent tenant at each level cannot exceed 100%. The sum of the computing resource percentages of all level-1 tenants cannot exceed 100%. + +#. Check whether the current tenant needs to be associated with resources of other services. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _admin_guide_000102__en-us_topic_0165590323_lcdfcd36b99d84c3ba2f290f976ade15b: + + Click **Associate Service** to configure other service resources used by the current tenant. + + a. Set **Services** to **HBase**. + b. Set **Association Type** as follows: + + - **Exclusive** indicates that the service resources are used by the tenant exclusively and cannot be associated with other tenants. + - **Shared** indicates that the service resources can be shared with other tenants. + + .. note:: + + - Only HBase can be associated with a new tenant. However, HDFS, HBase, and Yarn can be associated with existing tenants. + - To associate an existing tenant with service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Associate Service** to configure resources to be associated with the tenant. + - To disassociate an existing tenant from service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Delete** in the **Operation** column. In the displayed dialog box, select **I have read the information and understand the impact** and click **OK**. + + c. Click **OK**. + +#. .. _admin_guide_000102__en-us_topic_0165590323_l93b6a287f2a9444f9b34fcbcc1e595ac: + + Click **OK**. Wait until the system displays a message indicating that the tenant is successfully created. + +.. |image1| image:: /_static/images/en-us_image_0263899322.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_tenant.rst new file mode 100644 index 0000000..ae5cbc7 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_tenant.rst @@ -0,0 +1,124 @@ +:original_name: admin_guide_000101.html + +.. _admin_guide_000101: + +Adding a Tenant +=============== + +Scenario +-------- + +You can create tenants on FusionInsight Manager based on the resource consumption and isolation planning and requirements of services. + +Prerequisites +------------- + +- A tenant name has been planned based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. +- Resources to be allocated to the current tenant have been planned to ensure that the sum of resources of direct sub-tenants at each level does not exceed the resources of the current tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. Click |image1|. On the page that is displayed, configure tenant attributes according to :ref:`Table 1 `. + + .. _admin_guide_000101__en-us_topic_0165590104_t41dbef6c05f84b128695138843bed278: + + .. table:: **Table 1** Tenant parameters + + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +====================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Cluster | Indicates the cluster for which you want to create a tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | - Indicates the name of the current tenant. The value consists of 3 to 50 characters, including digits, letters, and underscores (_). | + | | - Plan a tenant name based on service requirements. The name cannot be the same as that of a role, HDFS directory, or Yarn queue that exists in the current cluster. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Resource Type | Specifies whether the tenant is a leaf tenant. | + | | | + | | - When **Leaf Tenant** **Resource** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. | + | | - When **Non-leaf Tenant** **Resource** is selected, the current tenant is not a leaf tenant and sub-tenants can be added to the current tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Computing Resource | Specifies the dynamic computing resources for the current tenant. | + | | | + | | - When **Yarn** is selected, the system automatically creates a queue in Yarn and the queue is named the same as the tenant name. | + | | | + | | - A leaf tenant can directly submit jobs to the queue. | + | | - A non-leaf tenant cannot directly submit jobs to the queue. However, Yarn adds an extra queue (hidden) named **default** for the non-leaf tenant to record the remaining resource capacity of the tenant. Actual jobs do not run in this queue. | + | | | + | | - If **Yarn** is not selected, the system does not automatically create a queue. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Mode | Indicates the configuration mode of computing resource parameters. | + | | | + | | - If you select **Basic**, you only need to set **Default Resource Pool Capacity (%)**. | + | | - If you select **Advanced**, you can manually configure the resource allocation weight and the minimum, maximum, and reserved resources of the tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Indicates the percentage of computing resources used by the current tenant in the default resource pool. The value ranges from **0** to **100%**. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Weight | Indicates the resource allocation weight. The value ranges from **0** to **100**. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum Resource | Indicates the resources guaranteed for the tenant (preemption supported). The value can be a percentage or an absolute value of the parent tenant's resources. When a tenant has a light workload, the resources of the tenant are automatically allocated to other tenants. When the available tenant resources are less than the value of **Minimum Resource**, the tenant can preempt the resources that have been lent to other tenants. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum Resource | Indicates the maximum resources that can be used by the tenant. The tenant cannot obtain more resources than the value configured. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Reserved Resource | Indicates the resources reserved for the tenant. The reserved resources cannot be used by other tenants even if no job is running in the current tenant resources. The value can be a percentage or an absolute value of the parent tenant's resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. | + | | | + | | - When **HDFS** is selected, the system automatically allocates storage resources. | + | | - When **HDFS** is not selected, the system does not automatically allocate storage resources. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Quota | Indicates the quota for files and directories. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota | Indicates the quota for the HDFS storage space used by the current tenant. | + | | | + | | - If the unit is set to **MB**, the value ranges from **1** to **8796093022208**. If the unit is set to **GB**, the value ranges from **1** to **8589934592**. | + | | - This parameter indicates the maximum HDFS storage space that can be used by the tenant, but not the actual space used. | + | | - If its value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Indicates the HDFS storage directory for the tenant. | + | | | + | | - The system automatically creates a folder named after the tenant name in the **/tenant** directory by default. For example, the default HDFS storage directory for tenant **ta1** is **/tenant/ta1**. | + | | - When a tenant is created for the first time, the system creates the **/tenant** directory in the HDFS root directory. The storage path is customizable. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies whether to associate resources of other services. For details, see :ref:`4 `. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current tenant. | + +------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Roles, computing resources, and storage resources are automatically created when tenants are created. + + - The new role has permissions on the computing and storage resources. This role and its permissions are automatically controlled by the system and cannot be manually managed by choosing **System** > **Permission** > **Role**. The role name is in the format of *Tenant name*\ \_\ *Cluster ID*. The ID of the first cluster is not displayed by default. + - When using this tenant, create a system user and bind the user to the role of the tenant. For details, see :ref:`Adding a User and Binding the User to a Tenant Role `. + - During the tenant creation, the system automatically creates a Yarn queue named after the tenant. If the queue name already exists, the new queue is named **Tenant name-**\ *N*. *N* indicates a natural number starting from **1**. When a same name exists, the value *N* increases automatically to differentiate the queue from others. For example, **saletenant**, **saletenant-1**, and **saletenant-2**. + +#. Check whether the current tenant needs to be associated with resources of other services. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`5 `. + +#. .. _admin_guide_000101__en-us_topic_0165590104_l95df8df02a794fd7adb2f27cfcb5c042: + + Click **Associate Service** to configure other service resources used by the current tenant, and click **OK**. + + - Set **Service** to **HBase** and **Association Type** to **Exclusive** or **Shared**. + + .. note:: + + - **Exclusive** indicates that the service resources are used by the tenant exclusively and cannot be associated with other tenants. + - **Shared** indicates that the service resources can be shared with other tenants. + + .. note:: + + - Only HBase can be associated with a new tenant. However, HDFS, HBase, and Yarn can be associated with existing tenants. + - To associate an existing tenant with service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Associate Service** to configure resources to be associated with the tenant. + - To disassociate an existing tenant from service resources, click the target tenant in the tenant list, switch to the **Service Associations** page, and click **Delete** in the **Operation** column. In the displayed dialog box, select **I have read the information and understand the impact** and click **OK**. + +#. .. _admin_guide_000101__en-us_topic_0165590104_lea52c6efc12849b4aca946b1c510728d: + + Click **OK**. Wait until the system displays a message indicating that the tenant is successfully created. + +.. |image1| image:: /_static/images/en-us_image_0263899257.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst new file mode 100644 index 0000000..31265f2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/adding_a_user_and_binding_the_user_to_a_tenant_role.rst @@ -0,0 +1,69 @@ +:original_name: admin_guide_000103.html + +.. _admin_guide_000103: + +Adding a User and Binding the User to a Tenant Role +=================================================== + +Scenario +-------- + +A newly created tenant cannot directly log in to the cluster to access resources. You need to add a user for the tenant on FusionInsight Manager and bind the user to the role of the tenant to assign operation permissions to the user. + +Prerequisites +------------- + +You have clarified service requirements and created a tenant. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **System** > **Permission** > **User**. + +#. If you want to add a user to the system, click **Create**. + + If you want to bind tenant roles to an existing user in the system, locate the row of the user and click **Modify** in the **Operation** column. + + Set user attributes according to :ref:`Table 1 `. + + .. _admin_guide_000103__en-us_topic_0193195962_t2b6451d372c44135bf8473b6b2dc0fd4: + + .. table:: **Table 1** User parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================================================================+ + | Username | Indicates the current username. The value contains 3 to 32 characters, including digits, letters, underscores (_), hyphens (-), and spaces. | + | | | + | | - The username cannot be the same as the OS username of any node in the cluster. Otherwise, the user cannot be used. | + | | - A username that differs only in alphabetic case from an existing username is not allowed. For example, if **User1** has been created, you cannot create **user1**. Enter the correct username when using **User1**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Type | The options are **Human-Machine** and **Machine-Machine**. | + | | | + | | - **Human-Machine** user: used for FusionInsight Manager O&M and component client operations. If you select this option, set both **Password** and **Confirm Password** accordingly. | + | | - **Machine-Machine** user: used for application development. If you select this option, the password is randomly generated. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Password | This parameter is mandatory if **User Type** is set to **Human-Machine**. | + | | | + | | The password must contain 8 to 64 characters of at least four types of the following: uppercase letters, lowercase letters, digits, special characters, and spaces. The password cannot be the username or the username spelled backwards. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Enter the password again. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Group | In the **User Group** area, click **Add** and select user groups to add the user to the groups. | + | | | + | | - If roles have been added to the user groups, the user can be granted the permissions of the roles. | + | | - For example, add the user to the Hive user group to assign Hive permissions to the user. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Primary Group | Select a group as the primary group for the user to create directories and files. The drop-down list contains all groups selected in **User Group**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Role | Click **Add** to bind a tenant role to the user. | + | | | + | | .. note:: | + | | | + | | - If a user wants to use the resources of tenant **tenant1** and to add or delete sub-tenants for **tenant1**, the user must be bound to both the **Manager_tenant** and **tenant1\_**\ *Cluster ID* roles. | + | | - If the tenant has been associated with the HBase service and Ranger authentication is enabled for the cluster, you need to configure the HBase execution permissions on the Ranger page. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Indicates the description of the current user. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/index.rst new file mode 100644 index 0000000..1954b50 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/creating_tenants/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000100.html + +.. _admin_guide_000100: + +Creating Tenants +================ + +- :ref:`Adding a Tenant ` +- :ref:`Adding a Sub-Tenant ` +- :ref:`Adding a User and Binding the User to a Tenant Role ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + adding_a_tenant + adding_a_sub-tenant + adding_a_user_and_binding_the_user_to_a_tenant_role diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/index.rst new file mode 100644 index 0000000..c165567 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/index.rst @@ -0,0 +1,20 @@ +:original_name: admin_guide_000099.html + +.. _admin_guide_000099: + +Using the Superior Scheduler +============================ + +- :ref:`Creating Tenants ` +- :ref:`Managing Tenants ` +- :ref:`Managing Resources ` +- :ref:`Managing Global User Policies ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_tenants/index + managing_tenants/index + managing_resources/index + managing_global_user_policies diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_global_user_policies.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_global_user_policies.rst new file mode 100644 index 0000000..a0ee314 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_global_user_policies.rst @@ -0,0 +1,56 @@ +:original_name: admin_guide_000115.html + +.. _admin_guide_000115: + +Managing Global User Policies +============================= + +Scenario +-------- + +If a tenant uses a Superior scheduler, you can configure the global policy for users to use the resource scheduler, including: + +- Maximum running apps +- Maximum pending apps +- Default queue + +Procedure +--------- + +- Add a policy. + + #. On FusionInsight Manager, choose **Tenant Resources**. + #. Choose **Dynamic Resource Plan**. + #. Click the **Global User Policy** tab. + + .. note:: + + **defaults(default setting)** indicates that the policy specified for **defaults** is used if a user does not have a global policy. The default policy cannot be deleted. + + #. Click **Create Global User Policy**. In the displayed dialog box, set the following parameters: + + - **Cluster**: Select the target cluster. + - **Username**: indicates the user for whom resource scheduling is controlled. Enter an existing username in the current cluster. + - **Max Running Apps**: indicates the maximum number of tasks that the user can run in the current cluster. + - **Max Pending Apps**: indicates the maximum number of tasks that the user can suspend in the current cluster. + - **Default Queue**: indicates the queue of the user. Enter the name of an existing queue in the current cluster. + +- Modify a policy. + + #. On FusionInsight Manager, choose **Tenant Resources**. + #. Choose **Dynamic Resource Plan**. + #. Click the **Global User Policy** tab. + #. In the row that contains the desired user policy, click **Modify** in the **Operation** column. + #. In the displayed dialog box, modify parameters and click **OK**. + +- Delete a policy. + + #. On FusionInsight Manager, choose **Tenant Resources**. + + #. Choose **Dynamic Resource Plan**. + + #. Click the **Global User Policy** tab. + + #. In the row that contains the desired user policy, click **Delete** in the **Operation** column. + + In the displayed dialog box, click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/adding_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/adding_a_resource_pool.rst new file mode 100644 index 0000000..075e5aa --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/adding_a_resource_pool.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000109.html + +.. _admin_guide_000109: + +Adding a Resource Pool +====================== + +Scenario +-------- + +In a cluster, you can logically group Yarn NodeManagers into Yarn resource pools. Each NodeManager belongs to only one resource pool. You can create a custom resource pool on FusionInsight Manager and add the hosts that have not been added to any custom resource pools to this resource pool so that specified queues can use the computing resources provided by these hosts. + +The system contains a **default** resource pool by default. All NodeManagers that are not added to custom resource pools belong to this resource pool. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Tenant Resources** > **Resource Pool**. + +#. Click **Add Resource Pool**. + +#. Set resource pool attributes. + + - **Cluster**: Select the cluster to which the resource pool is to be added. + - **Name**: Enter the name of the resource pool. The name contains 1 to 50 characters, including digits, letters, and underscores (_), and cannot start with an underscore (_). + - **Resource Label**: Enter the resource label of the resource pool. The value can contain 1 to 50 characters, including digits, letters, underscores (_), and hyphens (-), and must start with a digit or letter. + - **Resource**: In the **Available Hosts** area, select specified hosts and click |image1| to add the hosts to the **Selected Hosts** area. Only hosts in the cluster can be selected. The host list in the resource pool can be left blank. + + .. note:: + + You can filter hosts by host name, number of CPU cores, memory, operating system, or platform type based on service requirements. + +#. Click **OK**. + + After the resource pool is created, you can view its name, members, and mode in the resource pool list. Hosts that are added to the custom resource pool are no longer members of the **default** resource pool. + +.. |image1| image:: /_static/images/en-us_image_0263899621.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/clearing_queue_configurations.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/clearing_queue_configurations.rst new file mode 100644 index 0000000..82694c2 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/clearing_queue_configurations.rst @@ -0,0 +1,25 @@ +:original_name: admin_guide_000114.html + +.. _admin_guide_000114: + +Clearing Queue Configurations +============================= + +Scenario +-------- + +You can clear the configurations of a queue on FusionInsight MRS Manager when the queue does not need resources of a resource pool or the resource pool needs to be disassociated from the queue. Clearing queue configurations cancels the resource capacity policy of the queue in the resource pool. + +Prerequisites +------------- + +You have changed the default resource pool of the queue to another one. If a queue is to be disassociated from a resource pool, this resource pool cannot serve as the default resource pool of the queue. For details, see :ref:`Configuring a Queue `. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Dynamic Resource Plan**. +#. Select the name of the target cluster from **Cluster** and select a resource pool from **Resource Pool**. +#. Locate the row that contains the target resource name in the **Resource Allocation** area, and click **Clear** in the **Operation** column. +#. In the displayed dialog box, click **OK** to clear the queue configurations from the current resource pool. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_a_queue.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_a_queue.rst new file mode 100644 index 0000000..d982784 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_a_queue.rst @@ -0,0 +1,67 @@ +:original_name: admin_guide_000112.html + +.. _admin_guide_000112: + +Configuring a Queue +=================== + +Scenario +-------- + +You can modify the queue configurations for a specified tenant on FusionInsight Manager. + +Prerequisites +------------- + +A tenant who uses the Superior scheduler has been added. + +Procedure +--------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. Choose **Dynamic Resource Plan**. +#. Click the **Queue Configurations** tab. +#. Set **Cluster** to the name of the target cluster. In **All tenants resources** area, locate the row that contains the target tenant resource and click **Modify** in the **Operation** column. + + .. note:: + + - You can also access the **Modify Queue Configuration** page as follows: In the tenant list on the **Tenant Resources Management** page, click the target tenant, click the **Resource** tab, and click |image1| next to **Queue Configurations (Queue name)**. + - A queue can be bound to only one non-default resource pool. + - For parameters such as **Max Allocated vCores**, **Max Allocated Memory(MB)**, **Max Running Apps**, **Max Running Apps per User**, and **Max Pending Apps**, if the value of a sub-tenant is **-1**, the value of the parent tenant can be set to a specific limit. If the parent tenant value is a specific limit, the sub-tenant value can be set to **-1**. + - **Max Allocated vCores** and **Max Allocated Memory(MB)** must be both changed to values other than **-1**. + + .. table:: **Table 1** Queue configuration parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+====================================================================================================================================================================================================================================================================================================================================================================+ + | Max Master Shares(%) | Indicates the maximum percentage of resources occupied by all ApplicationMasters in the current queue. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Allocated vCores | Indicates the maximum number of cores that can be allocated to a single Yarn container in the current queue. The default value is **-1**, indicating that the number of cores is not limited within the value range. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Allocated Memory(MB) | Indicates the maximum memory that can be allocated to a single Yarn container in the current queue. The default value is **-1**, indicating that the memory is not limited within the value range. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Running Apps | Indicates the maximum number of tasks that can be executed at the same time in the current queue. The default value is **-1**, indicating that the number is not limited within the value range (the meaning is the same if the value is empty). Value **0** indicates that tasks cannot be executed. The value ranges from **-1** to **2147483647**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Running Apps per User | Indicates the maximum number of tasks that can be executed by each user in the current queue at the same time. The default value is **-1**, indicating that the number is not limited within the value range (the meaning is the same if the value is empty). Value **0** indicates that tasks cannot be executed. The value ranges from **-1** to **2147483647**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Pending Apps | Indicates the maximum number of tasks that can be suspended at the same time in the current queue. The default value is **-1**, indicating that the number is not limited within the value range (the meaning is the same if the value is empty). Value **0** indicates that tasks cannot be suspended. The value ranges from **-1** to **2147483647**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Allocation Rule | Indicates the rule for allocating resources to different tasks of a user. The rule can be **FIFO** or **FAIR**. | + | | | + | | If a user submits multiple tasks in the current queue and the rule is **FIFO**, the tasks are executed one by one in sequential order; If the rule is **FAIR**, resources are evenly allocated to all tasks. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Label | Indicates that tasks are executed on a node with a specified resource label. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Active | - **ACTIVE**: indicates that the current queue can receive and execute tasks. | + | | - **INACTIVE**: indicates that the current queue can receive but cannot execute tasks. Tasks submitted to the queue are suspended. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Open | - **OPEN**: indicates that the current queue is opened. | + | | - **CLOSED**: indicates that the current queue is closed. Tasks submitted to the queue are rejected. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Migrate Queue Upon Fault | If cross-AZ HA is enabled for a cluster and an AZ is faulty, set **Migrate Queue Upon Fault** to **TRUE** to migrate running queues of the tenant to other AZs. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899409.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst new file mode 100644 index 0000000..8748411 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/configuring_the_queue_capacity_policy_of_a_resource_pool.rst @@ -0,0 +1,54 @@ +:original_name: admin_guide_000113.html + +.. _admin_guide_000113: + +Configuring the Queue Capacity Policy of a Resource Pool +======================================================== + +Scenario +-------- + +After a resource pool is added, you can configure the capacity policy of available resources for Yarn queues so that jobs in the queues can be properly executed in the resource pool. + +This section describes how to configure the queue policy on FusionInsight Manager. Tenant queues equipped with the Superior scheduler can use resources in different resource pools. + +Prerequisites +------------- + +- You have logged in to FusionInsight Manager. + +- A resource pool has been added. +- The target queue is not associated with the resource pools of other queues except the default resource pool. + +Procedure +--------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. Choose **Dynamic Resource Plan**. +#. Click the **Resource Distribution Policy** tab. +#. Select the name of the target cluster from **Cluster** and select a resource pool from **Resource Pool**. +#. Locate the row that contains the target queue in the **Resource Allocation** area, and click **Modify** in the **Operation** column. +#. On the **Resource Configuration Policy** tab of the **Modify Resource Allocation** window, set the resource configuration policy of the queue in the resource pool. + + - **Weight**: indicates the resources that a tenant can obtain. Its initial value is the same as the minimum resource percentage. + - **Minimum Resource**: indicates the minimum resources that a tenant can obtain. + - **Maximum Resource**: indicates the maximum resources that a tenant can obtain. + - **Reserved Resource**: indicates the resources that are reserved for the tenant's queues and cannot be lent to other tenants' queues. + +#. Click the **User Policy** tab in the **Modify Resource Allocation** window and set the user policy. + + .. note:: + + **defaultUser(built-in)** indicates that the policy specified for **defaultUser** is used if a user does not have a policy. The default policy cannot be deleted. + + - Click **Add User Policy** to add a user policy. + + - **Username**: indicates the name of a user. + - **Weight**: indicates the resources that the user can obtain. + - **Max vCores**: indicates the maximum number of virtual cores that the user can obtain. + - **Max Memory(MB)**: indicates the maximum memory that the user can obtain. + + - Click **Modify** in the **Operation** column to modify an existing user policy. + - Click **Clear** in the **Operation** column to delete an existing user policy. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/deleting_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/deleting_a_resource_pool.rst new file mode 100644 index 0000000..62a1bb9 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/deleting_a_resource_pool.rst @@ -0,0 +1,25 @@ +:original_name: admin_guide_000111.html + +.. _admin_guide_000111: + +Deleting a Resource Pool +======================== + +Scenario +-------- + +If a resource pool is no longer used based on service requirements, you can delete it on FusionInsight Manager. + +Prerequisites +------------- + +- Any queue in the cluster does not use the resource pool to be deleted as the default resource pool. Before deleting the resource pool, cancel the default resource pool. For details, see :ref:`Configuring a Queue `. +- Resource distribution policies of all queues have been cleared from the resource pool to be deleted. For details, see :ref:`Clearing Queue Configurations `. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Resource Pool**. +#. Locate the row that contains the specified resource pool, and click **Delete** in the **Operation** column. +#. In the displayed dialog box, click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/index.rst new file mode 100644 index 0000000..1c9286a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/index.rst @@ -0,0 +1,24 @@ +:original_name: admin_guide_000108.html + +.. _admin_guide_000108: + +Managing Resources +================== + +- :ref:`Adding a Resource Pool ` +- :ref:`Modifying a Resource Pool ` +- :ref:`Deleting a Resource Pool ` +- :ref:`Configuring a Queue ` +- :ref:`Configuring the Queue Capacity Policy of a Resource Pool ` +- :ref:`Clearing Queue Configurations ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + adding_a_resource_pool + modifying_a_resource_pool + deleting_a_resource_pool + configuring_a_queue + configuring_the_queue_capacity_policy_of_a_resource_pool + clearing_queue_configurations diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/modifying_a_resource_pool.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/modifying_a_resource_pool.rst new file mode 100644 index 0000000..c464d9e --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_resources/modifying_a_resource_pool.rst @@ -0,0 +1,27 @@ +:original_name: admin_guide_000110.html + +.. _admin_guide_000110: + +Modifying a Resource Pool +========================= + +Scenario +-------- + +When hosts in a resource pool need to be adjusted based on service requirements, you can modify members in the resource pool on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Tenant Resources** > **Resource Pool**. +#. Locate the row that contains the specified resource pool, and click **Edit** in the **Operation** column. +#. In the **Resource** area, modify hosts. + + - Adding hosts: Select desired hosts in **Available Hosts** and click |image1| to add them to the resource pool. + - Deleting hosts: Select desired hosts in **Selected Hosts** and click |image2| to remove them from the resource pool. The host list in the resource pool can be left blank. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0263899458.png +.. |image2| image:: /_static/images/en-us_image_0263899454.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/deleting_a_tenant.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/deleting_a_tenant.rst new file mode 100644 index 0000000..bbcd211 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/deleting_a_tenant.rst @@ -0,0 +1,39 @@ +:original_name: admin_guide_000107.html + +.. _admin_guide_000107: + +Deleting a Tenant +================= + +Scenario +-------- + +You can delete tenants that are no longer used on FusionInsight Manager based on service requirements to release resources occupied by the tenants. + +Prerequisites +------------- + +- A tenant has been added. +- The tenant has no sub-tenants. If the tenant has sub-tenants, delete them; otherwise, the tenant cannot be deleted. +- The role of the tenant is not associated with any user or user group. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, click the target tenant and click |image1|. + + .. note:: + + - If you want to retain the tenant data, select **Reserve the data of this tenant resource**. Otherwise, the storage space of the tenant will be deleted. + +#. Click **OK**. + + It takes a few minutes to save the configuration. After the tenant is deleted, the role and storage space of the tenant are also deleted. + + .. note:: + + After the tenant is deleted, the queue of the tenant still exists in Yarn. The queue of the tenant is not displayed on the role management page in Yarn. + +.. |image1| image:: /_static/images/en-us_image_0000001375852797.png diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/index.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/index.rst new file mode 100644 index 0000000..2fb0a2a --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/index.rst @@ -0,0 +1,18 @@ +:original_name: admin_guide_000104.html + +.. _admin_guide_000104: + +Managing Tenants +================ + +- :ref:`Managing Tenant Directories ` +- :ref:`Restoring Tenant Data ` +- :ref:`Deleting a Tenant ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_tenant_directories + restoring_tenant_data + deleting_a_tenant diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/managing_tenant_directories.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/managing_tenant_directories.rst new file mode 100644 index 0000000..e13c882 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/managing_tenant_directories.rst @@ -0,0 +1,92 @@ +:original_name: admin_guide_000105.html + +.. _admin_guide_000105: + +Managing Tenant Directories +=========================== + +Scenario +-------- + +You can manage the HDFS storage directories used by specified tenants based on service requirements on FusionInsight Manager, such as adding tenant directories, changing the quotas for directories and files and for storage space, and deleting directories. + +Prerequisites +------------- + +A tenant with HDFS storage resources has been added. + +Viewing a Tenant Directory +-------------------------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. View the **HDFS Storage** table. + + - The **File Number Threshold** column provides the quota for files and directories of the tenant directory. + - The **Space Quota** column provides the storage space size of the tenant directory. + +Adding a Tenant Directory +------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** area, click **Create Directory**. + + - **Parent Directory**: indicates the storage directory used by the parent tenant of the current tenant. + + .. note:: + + This parameter is not displayed if the current tenant is not a sub-tenant. + + - Set **Path** to a tenant directory path. + + .. note:: + + If the current tenant is not a sub-tenant, the new path is created in the HDFS root directory. + + - Set **Quota** to the quota for files and directories. + - **File Number Threshold (%)** is valid only when **Quota** is set. If the ratio of the number of used files to the value of **Quota** exceeds the value of this parameter, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + + .. note:: + + The number of used files is collected every hour. Therefore, the alarm indicating that the ratio of used files exceeds the threshold is delayed. + + - Set **Space Quota** to the storage space size of the tenant directory. + - If the ratio of used storage space to the value of **Space Quota** exceeds the **Storage Space Threshold (%)** value, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + + .. note:: + + The used storage space is collected every hour. Therefore, the alarm indicating that the ratio of used storage space exceeds the threshold is delayed. + +#. Click **OK**. + +Modifying a Tenant Directory +---------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** table, click **Modify** in the **Operation** column of the specified tenant directory. + + - Set **Quota** to the quota for files and directories. + - **File Number Threshold (%)** is valid only when **Quota** is set. If the ratio of the number of used files to the value of **Quota** exceeds the value of this parameter, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + - Set **Space Quota** to the storage space size of the tenant directory. + - If the ratio of used storage space to the value of **Space Quota** exceeds the **Storage Space Threshold (%)** value, an alarm is generated. If this parameter is not specified, no alarm is reported in this scenario. + +#. Click **OK**. + +Deleting a Tenant Directory +--------------------------- + +#. On FusionInsight Manager, choose **Tenant Resources**. +#. In the tenant list on the left, click the target tenant. +#. Click the **Resource** tab. +#. In the **HDFS Storage** table, click **Delete** in the **Operation** column of the specified tenant directory. + + .. note:: + + The tenant directory that is created by the system during tenant creation cannot be deleted. + +#. Click **OK**. diff --git a/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/restoring_tenant_data.rst b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/restoring_tenant_data.rst new file mode 100644 index 0000000..df493c5 --- /dev/null +++ b/umn/source/fusioninsight_manager_operation_guide_applicable_to_3.x/tenant_resources/using_the_superior_scheduler/managing_tenants/restoring_tenant_data.rst @@ -0,0 +1,33 @@ +:original_name: admin_guide_000106.html + +.. _admin_guide_000106: + +Restoring Tenant Data +===================== + +Scenario +-------- + +Tenant data is stored on FusionInsight Manager and cluster components. When components are recovered from failures or reinstalled, some configuration data of all tenants may become abnormal. In this case, you need to manually restore the configuration data on FusionInsight Manager. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **Tenant Resources**. + +#. In the tenant list on the left, click the target tenant. + +#. Check the tenant data status. + + a. On the **Summary** page, check **Tenant Status**. A green icon indicates that the tenant is available and gray indicates that the tenant is unavailable. + b. Click **Resource** and check the icons on the left of **Yarn** and **HDFS Storage**. A green icon indicates that the resource is available, and gray indicates that the resource is unavailable. + c. Click **Service Associations** and check the **Status** column of the associated services. **Normal** indicates that the component can provide services for the associated tenant. **Not Available** indicates that the component cannot provide services for the tenant. + d. If any of the preceding check items is abnormal, go to :ref:`4 ` to restore tenant data. + +#. .. _admin_guide_000106__en-us_topic_0193195958_l62f85b027a17495484c1162c5dd730f1: + + Click |image1|. In the displayed dialog box, enter the password of the current login user and click **OK**. + +#. In the **Restore Tenant Resource Data** window, select one or more components to restore data, and click **OK**. The system automatically restores the tenant data. + +.. |image1| image:: /_static/images/en-us_image_0263899446.png diff --git a/umn/source/glossary.rst b/umn/source/glossary.rst new file mode 100644 index 0000000..bce6465 --- /dev/null +++ b/umn/source/glossary.rst @@ -0,0 +1,8 @@ +:original_name: en-us_topic_0000001296775220.html + +.. _en-us_topic_0000001296775220: + +Glossary +======== + +For details about the terms involved in this document, see `Glossary `__. diff --git a/umn/source/high-risk_operations.rst b/umn/source/high-risk_operations.rst new file mode 100644 index 0000000..0c32720 --- /dev/null +++ b/umn/source/high-risk_operations.rst @@ -0,0 +1,427 @@ +:original_name: mrs_01_0785.html + +.. _mrs_01_0785: + +High-Risk Operations +==================== + +Forbidden Operations +-------------------- + +:ref:`Table 1 ` lists forbidden operations during the routine cluster operation and maintenance process. + +.. _mrs_01_0785__admin_guide_000295_en-us_topic_0046737034_table44115416: + +.. table:: **Table 1** Forbidden operations + + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Item | Risk | + +==============================================================================================================+==================================================================================================================================================================+ + | Delete ZooKeeper data directories. | ClickHouse, HDFS, Yarn, HBase, and Hive depend on ZooKeeper, which stores metadata. This operation has adverse impact on normal operating of related components. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Performing switchover frequently between active and standby JDBCServer nodes | This operation may interrupt services. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Delete Phoenix system tables and data (SYSTEM.CATALOG, SYSTEM.STATS, SYSTEM.SEQUENCE, and SYSTEM. FUNCTION). | This operation will cause service operation failures. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Manually modify data in the Hive metabase (hivemeta database). | This operation may cause Hive data parse errors. As a result, Hive cannot provide services. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Do not manually perform **INSERT** or **UPDATE** operations on Hive metadata tables. | This operation may cause Hive data parse errors. As a result, Hive cannot provide services. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Change permission on the Hive private file directory **hdfs:///tmp/hive-scratch**. | This operation may cause unavailable Hive services. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Modify **broker.id** in the Kafka configuration file. | This operation may cause invalid node data. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Modify the host names of nodes. | Instances and upper-layer components on the host cannot provide services properly. The fault cannot be rectified. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Reinstall the OS of a node. | This operation will cause MRS cluster exceptions, leaving MRS clusters in abnormal status. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Use private images. | This operation will cause MRS cluster exceptions, leaving MRS clusters in abnormal status. | + +--------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +The following tables list the high-risk operations during the operation and maintenance of each component. + +High-Risk Operations on a Cluster +--------------------------------- + +.. table:: **Table 2** High-risk operations on a cluster + + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +===================================================================================+============================================================================================================================================================+==========+===========================================================================================================================================================================================+=====================================================+ + | Modify the file directory or file permissions of user **omm** without permission. | This operation will lead to MRS service unavailability. | ▲▲▲▲▲ | Do not perform this operation. | Check whether the MRS cluster service is available. | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Bind an EIP. | This operation exposes the Master node hosting MRS Manager of the cluster to the public network, increasing the risk of network attacks from the Internet. | ▲▲▲▲▲ | Ensure that the bound EIP is a trusted public IP address. | None | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Enable security group rules for port 22 of a cluster. | This operation increases the risk of exploiting vulnerability of port 22. | ▲▲▲▲▲ | Configure a security group rule for port 22 to allow only trusted IP addresses to access the port. You are not advised to configure the inbound rule to allow 0.0.0.0 to access the port. | None | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Delete a cluster or cluster data. | Data will get lost. | ▲▲▲▲▲ | Before deleting the data, confirm the necessity of the operation and ensure that the data has been backed up. | None | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Scale in a cluster. | Data will get lost. | ▲▲▲▲▲ | Before scaling in the cluster, confirm the necessity of the operation and ensure that the data has been backed up. | None | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + | Detach or format a data disk. | Data will get lost. | ▲▲▲▲▲ | Before performing this operation, confirm the necessity of the operation and ensure that the data has been backed up. | None | + +-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------+ + +Manager High-Risk Operations +---------------------------- + +.. table:: **Table 3** Manager high-risk operations + + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +===============================================================================================+===================================================================================================================================================================================================================================================================================================+=============+=======================================================================================================================================================================================================+===============================================================================================================+ + | Change the OMS password. | This operation will restart all processes of OMSServer, which has adverse impact on cluster maintenance and management. | ▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check whether there are uncleared alarms and whether the cluster management and maintenance are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Import the certificate. | This operation will restart OMS processes and the entire cluster, which has adverse impact on cluster maintenance and management and services. | ▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Perform an upgrade. | This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster. | ▲▲▲ | Ensure that there is no other maintenance and management operations when the operation is performed. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + | | | | | | + | | Strictly manage the user who is eligible to assign the cluster management permission to prevent security risks. | | | | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Restore the OMS. | This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster. | ▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Change an IP address. | This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster. | ▲▲▲ | Ensure that there is no other maintenance and management operations when the operation is performed and that the new IP address is correct. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Change log levels. | If the log level is changed to **DEBUG**, Manager responds slowly. | ▲▲ | Before the modification, confirm the necessity of the operation and change it back to the default log level in time. | None | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Replace a control node. | This operation will interrupt services deployed on the node. If the node is a management node, the operation will restart all OMS processes, affecting the cluster management and maintenance. | ▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Replace a management node. | This operation will interrupt services deployed on the node. As a result, OMS processes will be restarted, affecting the cluster management and maintenance. | ▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Restart the upper-layer service at the same time during the restart of a lower-layer service. | This operation will interrupt the upper-layer service, affecting the management, maintenance, and services of the cluster. | ▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Modify the OLDAP port. | This operation will restart the LdapServer and Kerberos services and all associated services, affecting service running. | ▲▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | None | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Delete the **supergroup** group. | Deleting the **supergroup** group decreases user rights, affecting service access. | ▲▲▲▲▲ | Before the change, confirm the rights to be added. Ensure that the required rights have been added before deleting the **supergroup** rights to which the user is bound, ensuring service continuity. | None | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Restart a service. | Services will be interrupted during the restart. If you select and restart the upper-layer service, the upper-layer services that depend on the service will be interrupted. | ▲▲▲ | Confirm the necessity of restarting the system before the operation. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Change the default SSH port No. | After the default port (22) is changed, functions such as cluster creation, service/instance adding, host adding, and host reinstallation cannot be used, and results of cluster health check items for node mutual trust, **omm**/**ommdba** user password expiration, and others are incorrect. | ▲▲▲ | Before performing this operation, restore the SSH port to the default value. | None | + +-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + +ClickHouse High-Risk Operations +------------------------------- + +.. table:: **Table 4** ClickHouse high-risk operations + + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +===========================================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================+==========+=========================================================================================================================================================+====================================================================================================================================================+ + | Delete data directories. | This operation may cause service information loss. | ▲▲▲ | Do not delete data directories manually. | Check whether data directories are normal. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Remove ClickHouseServer instances. | The ClickHouseServer instance nodes in the same shard must be removed in at the same time. Otherwise, the topology information of the logical cluster is incorrect. Before performing this operation, check the database and data table information of each node in the logical cluster and perform scale-in pre-analysis to ensure that data is successfully migrated during the scale-in process to prevent data loss | ▲▲▲▲▲ | Before scale-in, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes. | Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add ClickHouseServer instances. | When performing this operation, you must check whether a database or data table with the same name as that on the old node needs to be created on the new node. Otherwise, subsequent data migration, data balancing, scale-in, and decommissioning will fail. | ▲▲▲▲▲ | Before scale-out, confirm the function and purpose of new ClickHouseServer instances and determine whether to create related databases and data tables. | Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Decommission ClickHouseServer instances. | The ClickHouseServer instance nodes in the same shard must be decommissioned in at the same time. Otherwise, the topology information of the logical cluster is incorrect. Before performing this operation, check the database and data table information of each node in the logical cluster and perform decommissioning pre-analysis to ensure that data is successfully migrated during the decommissioning process to prevent data loss | ▲▲▲▲▲ | Before decommissioning, collect information in advance to learn the status of the ClickHouse logical cluster and instance nodes. | Check the ClickHouse logical cluster topology information, database and data table information in each ClickHouseServer instance, and data volume. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Recommission ClickHouseServer instances. | When performing this operation, you must select all nodes in the original shard. Otherwise, the topology information of the logical cluster is incorrect. | ▲▲▲▲▲ | Before recommissioning, you need to confirm the home information about the shards of the node to be recommissioned. | Check the ClickHouse logical cluster topology information. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Modify data directory content (file and folder creation). | This operation may cause the ClickHouse instance of the node faults. | ▲▲▲ | Do not create or modify files or folders in the data directories manually. | Check whether data directories are normal. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Start or stop basic components independently. | This operation has adverse impact on the basic functions of some services. As a result, service failures occur. | ▲▲▲ | Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation. | Check whether the service status is normal. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Restart or stop services. | This operation may interrupt services. | ▲▲ | Restart or stop services when necessary. | Check whether the service is running properly. | + +-----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+ + +DBService High-Risk Operations +------------------------------ + +.. table:: **Table 5** DBService high-risk operations + + +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +==============================================+================================================================================================================================================+=============+==================================================================================================================================================================+=========================================================================================================+ + | Change the DBService password. | The services need to be restarted for the password change to take effect. The services are unavailable during the restart. | ▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check whether there are uncleared alarms and whether the cluster management and maintenance are normal. | + +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+ + | Restore DBService data. | After the data is restored, the data generated after the data backup and before the data restoration is lost. | ▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check whether there are uncleared alarms and whether the cluster management and maintenance are normal. | + | | | | | | + | | After the data is restored, the configuration of the components that depend on DBService may expire and these components need to be restarted. | | | | + +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+ + | Perform active/standby DBService switchover. | During the DBServer switchover, DBService is unavailable. | ▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | None | + +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+ + | Change the DBService floating IP address. | The DBService needs to be restarted for the change to take effect. The DBService is unavailable during the restart. | ▲▲▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + | | | | | | + | | If the floating IP address has been used, the configuration will fail, and the DBService will fail to be started. | | | | + +----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+ + +Flink High-Risk Operations +-------------------------- + +.. table:: **Table 6** Flink high-risk operations + + +--------------------------+----------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +==========================+==================================================================================+==========+======================================================================================================================+======================================================+ + | Change log levels. | If the log level is modified to DEBUG, the task running performance is affected. | ▲▲ | Before the modification, confirm the necessity of the operation and change it back to the default log level in time. | None | + +--------------------------+----------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Modify file permissions. | Tasks may fail. | ▲▲▲ | Confirm the necessity of the operation before the modification. | Check whether related service operations are normal. | + +--------------------------+----------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + +Flume High-Risk Operations +-------------------------- + +.. table:: **Table 7** Flume high-risk operations + + +----------------------------------------------------------------------+-----------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +======================================================================+=========================================================================================+=============+=======================================================================================================================================================+===========================================================================================+ + | Modify the Flume instance start parameter **GC_OPTS**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------+-----------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+ + | Change the default value of **dfs.replication** from **3** to **1**. | This operation will have the following impacts: | ▲▲▲▲ | When modifying related configuration items, check the parameter description carefully. Ensure that there are more than two replicas for data storage. | Check whether the default replica number is not 1 and whether the HDFS service is normal. | + | | | | | | + | | #. The storage reliability deteriorates. If the disk becomes faulty, data will be lost. | | | | + | | #. NameNode fails to be restarted, and the HDFS service is unavailable. | | | | + +----------------------------------------------------------------------+-----------------------------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+ + +HBase High-Risk Operations +-------------------------- + +.. table:: **Table 8** HBase high-risk operations + + +-------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +=========================================================================================================================+========================================================+=============+=====================================================================================================================================================================+=======================================================+ + | Modify encryption configuration. | Services cannot start properly. | ▲▲▲▲ | Strictly follow the prompt information when modifying related configuration items, which are associated. Ensure that new values are valid. | Check whether services can be started properly. | + | | | | | | + | - hbase.regionserver.wal.encryption | | | | | + | - hbase.crypto.keyprovider.parameters.uri | | | | | + | - hbase.crypto.keyprovider.parameters.encryptedtext | | | | | + +-------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+ + | Change the value of **hbase.regionserver.wal.encryption** to **false** or switch encryption algorithm from AES to SMS4. | This operation may cause start failures and data loss. | ▲▲▲▲ | When HFile and WAL are encrypted using an encryption algorithm and a table is created, do not close or switch the encryption algorithm randomly. | None | + | | | | | | + | | | | If an encryption table (ENCRYPTION=>AES/SMS4) is not created, you can only switch the encryption algorithm. | | + +-------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+ + | Modify HBase instance start parameter **GC_OPTS** and **HBASE_HEAPSIZE**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HBASE_HEAPSIZE. | Check whether services can be started properly. | + +-------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+ + | Use **OfflineMetaRepair** tool | Services cannot start properly. | ▲▲▲▲ | This tool can be used only when HBase is offline and cannot be used in data migration scenarios. | Check whether HBase services can be started properly. | + +-------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+ + +HDFS High-Risk Operations +------------------------- + +.. table:: **Table 9** HDFS high-risk operations + + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +===================================================================================================================================+=====================================================================================================================================================+=============+======================================================================================================================================================================+=================================================================================================================================================+ + | Change HDFS NameNode data storage directory **dfs.namenode.name.dir** and data configuration directory **dfs.datanode.data.dir**. | Services cannot start properly. | ▲▲▲▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Use the **-delete** parameter when you run the **hadoop distcp** command. | During DistCP copying, files that do not exist in the source cluster but exist in the destination cluster are deleted from the destination cluster. | ▲▲ | When using DistCP, determine whether to retain the redundant files in the destination cluster. Exercise caution when using the **-delete** parameter. | After DistCP copying is complete, check whether the data in the destination cluster is retained or deleted according to the parameter settings. | + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Modify the HDFS instance start parameter **GC_OPTS**, **HADOOP_HEAPSIZE**, and **GC_PROFILE**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. GC_OPTS does not conflict with HADOOP_HEAPSIZE. | Check whether services can be started properly. | + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Change the default value of **dfs.replication** from **3** to **1**. | This operation will have the following impacts: | ▲▲▲▲ | When modifying related configuration items, check the parameter description carefully. Ensure that there are more than two replicas for data storage. | Check whether the default replica number is not 1 and whether the HDFS service is normal. | + | | | | | | + | | #. The storage reliability deteriorates. If the disk becomes faulty, data will be lost. | | | | + | | #. NameNode fails to be restarted, and the HDFS service is unavailable. | | | | + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Change the remote procedure call (RPC) channel encryption mode (**hadoop.rpc.protection**) of each module in Hadoop. | This operation causes service faults and service exceptions. | ▲▲▲▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether HDFS and other services that depend on HDFS can properly start and provide services. | + +-----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + +Hive High-Risk Operations +------------------------- + +.. table:: **Table 10** Hive high-risk operations + + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +========================================================================================================================================+=====================================================================================================================================+=============+================================================================================================================================+======================================================+ + | Modify the Hive instance start parameter **GC_OPTS**. | This operation may cause Hive instance start failures. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Delete all MetaStore instances. | This operation may cause Hive metadata loss. As a result, Hive cannot provide services. | ▲▲▲ | Do not perform this operation unless ensure that Hive table information can be discarded. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Delete or modify files corresponding to Hive tables over HDFS interfaces or HBase interfaces. | This operation may cause Hive service data loss or tampering. | ▲▲ | Do not perform this operation unless ensure that the data can be discarded or that the operation meets service requirements. | Check whether Hive data is complete. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Delete or modify files corresponding to Hive tables or directory access permission over HDFS interfaces or HBase interfaces. | This operation may cause related service scenarios to be unavailable. | ▲▲▲ | Do not perform this operation. | Check whether related service operations are normal. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Delete or modify **hdfs:///apps/templeton/hive-3.1.0.tar.gz** over HDFS interfaces. | WebHCat fails to perform services due to this operation. | ▲▲ | Do not perform this operation. | Check whether related service operations are normal. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Export table data to overwrite the data at the local. For example, export the data of **t1** to **/opt/dir**. | This operation will delete target directories. Incorrect setting may cause software or OS startup failures. | ▲▲▲▲▲ | Ensure that the path where the data is written does not contain any files or do not use the key word overwrite in the command. | Check whether files in the target path are lost. | + | | | | | | + | **insert overwrite local directory '/opt/dir' select \* from t1;** | | | | | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + | Direct different databases, tables, or partition files to the same path, for example, default warehouse path **/user/hive/warehouse**. | The creation operation may cause disordered data. After a database, table, or partition is deleted, other object data will be lost. | ▲▲▲▲▲ | Do not perform this operation. | Check whether files in the target path are lost. | + +----------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+ + +Kafka High-Risk Operations +-------------------------- + +.. table:: **Table 11** Kafka high-risk operations + + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +=======================================================================================+===================================================================================================================================================================================+==========+============================================================================================================================================+================================================================================================+ + | Delete Topic | This operation may delete existing topics and data. | ▲▲▲ | Kerberos authentication is used to ensure that authenticated users have operation permissions. Ensure that topic names are correct. | Check whether topics are processed properly. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Delete data directories. | This operation may cause service information loss. | ▲▲▲ | Do not delete data directories manually. | Check whether data directories are normal. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify data directory content (file and folder creation). | This operation may cause the Broker instance of the node faults. | ▲▲▲ | Do not create or modify files or folders in the data directories manually. | Check whether data directories are normal. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify the disk auto-adaptation function using the **disk.adapter.enable** parameter. | This operation adjusts the topic data retention period when the disk usage reaches the threshold. Historical data that does not fall within the storage retention may be deleted. | ▲▲▲ | If the retention period of some topics cannot be adjusted, add this topic to the value of **disk.adapter.topic.blacklist**. | Observe the data storage period on the Kafka topic monitoring page. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify data directory **log.dirs** configuration. | Incorrect operation may cause process faults. | ▲▲▲ | Ensure that the added or modified data directories are empty and that the directory permissions are right. | Check whether data directories are normal. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Reduce the capacity of the Kafka cluster. | This operation may cause quantity reduction of backups of some data duplicates of topic. As a result, some topics cannot be accessed. | ▲▲ | Perform backup operation and then reduce the capacity of the Kafka cluster. | Check whether backup nodes where partitions are located are activated to ensure data security. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Start or stop basic components independently. | This operation has adverse impact on the basic functions of some services. As a result, service failures occur. | ▲▲▲ | Do not start or stop ZooKeeper, Kerberos, and LDAP basic components independently. Select related services when performing this operation. | Check whether the service status is normal. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Restart or stop services. | This operation may interrupt services. | ▲▲ | Restart or stop services when necessary. | Check whether the service is running properly. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify configuration parameters. | This operation requires service restart for configuration to take effect. | ▲▲ | Modify configuration when necessary. | Check whether the service is running properly. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Delete or modify metadata. | Modifying or deleting Kafka metadata on ZooKeeper may cause the Kafka topic or service unavailability. | ▲▲▲ | Do not delete or modify Kafka metadata stored on ZooKeeper. | Check whether the Kafka topics or Kafka service is available. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Delete metadata backup files. | After Kafka metadata backup files are modified and used to restore Kafka metadata, Kafka topics or the Kafka service may be unavailable. | ▲▲▲ | Do not delete Kafka metadata backup files. | Check whether the Kafka topics or Kafka service is available. | + +---------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + +KrbServer High-Risk Operations +------------------------------ + +.. table:: **Table 12** KrbServer high-risk operations + + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +=====================================================+==============================================================================================================================================================================================================================+==========+==================================================================================================================================================================+===============================================================================================================+ + | Modify the **KADMIN_PORT** parameter of KrbServer. | After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected. | ▲▲▲▲▲ | After this parameter is modified, restart the KrbServer service and all its associated services. | None | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Modify the **kdc_ports** parameter of KrbServer. | After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected. | ▲▲▲▲▲ | After this parameter is modified, restart the KrbServer service and all its associated services. | None | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Modify the **KPASSWD_PORT** parameter of KrbServer. | After this parameter is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected. | ▲▲▲▲▲ | After this parameter is modified, restart the KrbServer service and all its associated services. | None | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Modify the domain name of Manager system. | After the domain name is modified, if the KrbServer service and its associated services are not restarted in a timely manner, the configuration of KrbClient in the cluster is abnormal and the service running is affected. | ▲▲▲▲▲ | After this parameter is modified, restart the KrbServer service and all its associated services. | None | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Configure cross-cluster mutual trust relationships. | This operation will restart the KrbServer service and all associated services, affecting the management and maintenance and services of the cluster. | ▲▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + +LdapServer High-Risk Operations +------------------------------- + +.. table:: **Table 13** LdapServer high-risk operations + + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +==========================================================+===============================================================================================================================================================================================================================+==========+==================================================================================================================================================================+===============================================================================================================+ + | Modify the **LDAP_SERVER_PORT** parameter of LdapServer. | After this parameter is modified, if the LdapServer service and its associated services are not restarted in a timely manner, the configuration of LdapClient in the cluster is abnormal and the service running is affected. | ▲▲▲▲▲ | After this parameter is modified, restart the LdapServer service and all its associated services. | None | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Restore LdapServer data. | This operation will restart Manager and the entire cluster, affecting management, maintenance, and services of the cluster. | ▲▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Replace the Node where LdapServer is located. | This operation will interrupt services deployed on the node. If the node is a management node, the operation will restart all OMS processes, affecting the cluster management and maintenance. | ▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | Check for uncleared alarms, and check whether the cluster management and maintenance and services are normal. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Change the password of LdapServer. | The LdapServer and Kerberos services need to be restarted during the password change, affecting the management, maintenance, and services of the cluster. | ▲▲▲▲ | Before performing the operation, ensure that the operation is necessary, and that no other management and maintenance operations are performed at the same time. | None | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + | Restart the node where LdapServer is located. | Restarting the node without stopping the LdapServer service may cause LdapServer data damage. | ▲▲▲▲▲ | Restore LdapServer using LdapServer backup data | None | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ + +Loader High-Risk Operations +--------------------------- + +.. table:: **Table 14** Loader high-risk operations + + +----------------------------------------------------------------------------+--------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +============================================================================+==============================================================+==========+======================================================================================================================+=====================================================================================+ + | Change the floating IP address of a Loader instance (**loader.float.ip**). | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether the Loader UI can be connected properly. | + +----------------------------------------------------------------------------+--------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | Modify the Loader instance start parameter **LOADER_GC_OPTS**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------+--------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | Clear table contents when adding data to HBase. | This operation will clear original data in the target table. | ▲▲ | Ensure that the contents in the target table can be cleared before the operation. | Check whether the contents in the target table can be cleared before the operation. | + +----------------------------------------------------------------------------+--------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + +Spark2x High-risk Operations +---------------------------- + +.. note:: + + Spark high-risk operations apply to MRS 3.x earlier versions. + +.. table:: **Table 15** Spark2x high-risk operations + + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +========================================================================================+======================================================================+==========+==========================================================================================================================================+===============================================================================+ + | Modify the configuration item **spark.yarn.queue**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Modify the configuration item **spark.driver.extraJavaOptions**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Modify the configuration item **spark.yarn.cluster.driver.extraJavaOptions**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Modify the configuration item **spark.eventLog.dir**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Modify the configuration item **SPARK_DAEMON_JAVA_OPTS**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Delete all JobHistory2x instances. | The event logs of historical applications are lost. | ▲▲ | Reserve at least one JobHistory2x instance. | Check whether historical application information is included in JobHistory2x. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Delete or modify the **/user/spark2x/jars/8.1.0.1/spark-archive-2x.zip** file in HDFS. | JDBCServer2x fails to be started and service functions are abnormal. | ▲▲▲ | Delete **/user/spark2x/jars/8.1.0.1/spark-archive-2x.zip**, and wait for 10-15 minutes until the .zip package is automatically restored. | Check whether services can be started properly. | + +----------------------------------------------------------------------------------------+----------------------------------------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + +Storm High-Risk Operations +-------------------------- + +.. table:: **Table 16** Storm high-risk operations + + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +================================================================================================+============================================================================================================================================================================+=============+============================================================================================================================================================+================================================================================================+ + | Modify the following plug-in related configuration items: | Services cannot start properly. | ▲▲▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that the class names exist and are valid. | Check whether services can be started properly. | + | | | | | | + | - storm.scheduler | | | | | + | - nimbus.authorizer | | | | | + | - storm.thrift.transport | | | | | + | - nimbus.blobstore.class | | | | | + | - nimbus.topology.validator | | | | | + | - storm.principal.tolocal | | | | | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify the Storm instance **GC_OPTS** startup parameters, including: | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + | | | | | | + | NIMBUS_GC_OPTS | | | | | + | | | | | | + | SUPERVISOR_GC_OPTS | | | | | + | | | | | | + | UI_GC_OPTS | | | | | + | | | | | | + | LOGVIEWER_GC_OPTS | | | | | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify the user resource pool configuration parameter **resource.aware.scheduler.user.pools**. | Services cannot run properly. | ▲▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that resources allocated to each user are appropriate and valid. | Check whether services can be started and run properly | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Change data directories. | If this operation is not properly performed, services may be abnormal and unavailable. | ▲▲▲▲ | Do not manually change data directories. | Check whether data directories are normal. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Restart services or instances. | The service will be interrupted for a short period of time, and ongoing operations will be interrupted. | ▲▲▲ | Restart services or instances when necessary. | Check whether the service is running properly and whether interrupted operations are restored. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Synchronize configurations (by restarting the required service). | The service will be restarted, resulting in temporary service interruption. If Supervisor is restarted, ongoing operations will be interrupted for a short period of time. | ▲▲▲ | Modify configuration when necessary. | Check whether the service is running properly and whether interrupted operations are restored. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Stop services or instances. | The service will be stopped, and related operations will be interrupted. | ▲▲▲ | Stop services when necessary. | Check whether the services are properly stopped. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Delete or modify metadata. | If Nimbus metadata is deleted, services are abnormal and ongoing operations are lost. | ▲▲▲▲▲ | Do not manually delete Nimbus metadata files. | Check whether Nimbus metadata files are normal. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Modify file permissions. | If permissions on the metadata and log directories are incorrectly modified, service exceptions may occur. | ▲▲▲▲ | Do not manually modify file permissions. | Check whether the permissions on the data and log directories are correct. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + | Delete topologies. | Topologies in use will be deleted. | ▲▲▲▲ | Delete topologies when necessary. | Check whether the topologies are successfully deleted. | + +------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+ + +Yarn High-Risk Operations +------------------------- + +.. table:: **Table 17** Yarn high-risk operations + + +-------------------------------------------------------------------+----------------------------------------------------+-------------+------------------------------------------+--------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +===================================================================+====================================================+=============+==========================================+============================================+ + | Delete or change data directories | This operation may cause service information loss. | ▲▲▲ | Do not delete data directories manually. | Check whether data directories are normal. | + | | | | | | + | **yarn.nodemanager.local-dirs** and **yarn.nodemanager.log-dirs** | | | | | + +-------------------------------------------------------------------+----------------------------------------------------+-------------+------------------------------------------+--------------------------------------------+ + +ZooKeeper High-Risk Operations +------------------------------ + +.. table:: **Table 18** ZooKeeper high-risk operations + + +------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+ + | Operation | Risk | Severity | Workaround | Check Item | + +============================================================+==============================================================================================================================================+==========+============================================================================================================================================================================+===============================================================================================+ + | Delete or change ZooKeeper data directories. | This operation may cause service information loss. | ▲▲▲ | Follow the capacity expansion guide to change the ZooKeeper data directories. | Check whether services and associated components are started properly. | + +------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+ + | Modify the ZooKeeper instance start parameter **GC_OPTS**. | Services cannot start properly. | ▲▲ | Strictly follow the prompt information when modifying related configuration items. Ensure that new values are valid. | Check whether services can be started properly. | + +------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+ + | Modify the znode ACL information in ZooKeeper. | If znode permission is modified in ZooKeeper, other users may have no permission to access the znode and some system functions are abnormal. | ▲▲▲▲ | During the modification, strictly follow the ZooKeeper Configuration Guide and ensure that other components can use ZooKeeper properly after ACL information modification. | Check that other components that depend on ZooKeeper can properly start and provide services. | + +------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+ diff --git a/umn/source/index.rst b/umn/source/index.rst index 66a4ddf..21f3ceb 100644 --- a/umn/source/index.rst +++ b/umn/source/index.rst @@ -2,3 +2,25 @@ Map Reduce Service - User Guide =============================== +.. toctree:: + :maxdepth: 1 + + overview/index + preparing_a_user/index + mrs_quick_start/index + configuring_a_cluster/index + managing_clusters/index + using_an_mrs_client/index + configuring_a_cluster_with_storage_and_compute_decoupled/index + accessing_web_pages_of_open_source_components_managed_in_mrs_clusters/index + accessing_manager/index + fusioninsight_manager_operation_guide_applicable_to_3.x/index + mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/index + security_description/index + high-risk_operations + backup_and_restoration/index + data_backup_and_restoration/index + appendix/index + faq/index + change_history + glossary diff --git a/umn/source/managing_clusters/alarm_management/index.rst b/umn/source/managing_clusters/alarm_management/index.rst new file mode 100644 index 0000000..10186ba --- /dev/null +++ b/umn/source/managing_clusters/alarm_management/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0112.html + +.. _mrs_01_0112: + +Alarm Management +================ + +- :ref:`Viewing the Alarm List ` +- :ref:`Viewing the Event List ` +- :ref:`Viewing and Manually Clearing an Alarm ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_the_alarm_list + viewing_the_event_list + viewing_and_manually_clearing_an_alarm diff --git a/umn/source/managing_clusters/alarm_management/viewing_and_manually_clearing_an_alarm.rst b/umn/source/managing_clusters/alarm_management/viewing_and_manually_clearing_an_alarm.rst new file mode 100644 index 0000000..a86e8e8 --- /dev/null +++ b/umn/source/managing_clusters/alarm_management/viewing_and_manually_clearing_an_alarm.rst @@ -0,0 +1,79 @@ +:original_name: mrs_01_0113.html + +.. _mrs_01_0113: + +Viewing and Manually Clearing an Alarm +====================================== + +Scenario +-------- + +You can view and clear alarms on MRS. + +Generally, the system automatically clears an alarm when the fault is rectified. If the fault has been rectified and the alarm cannot be automatically cleared, you can manually clear the alarm. + +You can view the latest 100,000 alarms (including uncleared, manually cleared, and automatically cleared alarms) on MRS. If the number of cleared alarms exceeds 100,000 and is about to reach 110,000, the system automatically dumps the earliest 10,000 cleared alarms to the dump path. + +3. In versions earlier than x, the value is the same as that of ${BIGDATA_HOME}/OMSV100R001C00x8664/workspace/data for the active management node. + +(For 3.x and later versions) The path is **${BIGDATA_HOME}/om-server/OMS/workspace/data** of the active management node. + +A directory is automatically generated when alarms are dumped for the first time. + +.. note:: + + Set an automatic refresh interval or click |image1| for an immediate refresh. + + The following refresh interval options are supported: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + +Procedure +--------- + +#. Choose **Clusters > Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Alarms** and view the alarm information in the alarm list. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Viewing and Manually Clearing an Alarm `. + + - By default, the alarm list page displays the latest 10 alarms. + - By default, data is sorted in descending order based on the generation time. For MRS 3.x or earlier, you can click the alarm ID, severity, and generation time to modify the sorting mode. For clusters of MRS 3.x or later, you can click the severity and generation time to modify the sorting mode. + - You can filter all alarms of the same severity. The results include cleared and uncleared alarms. + - For clusters of MRS 3.x and earlier versions, you can click |image2|, |image3|, |image4| or |image5| in the upper right corner of the page to quickly filter **Critical**, **Major**, **Minor**, or **Suggestion** alarms that are uncleared. + - For clusters of MRS 3.x or later: You can click |image6|, |image7|, |image8| or |image9| in the upper right corner of the page to quickly filter uncleared **Critical**, **Major**, **Minor** or **Warning** alarms. + +3. Click **Advanced Search**. In the displayed alarm search area, set search criteria and click **Search** to view the information about specified alarms. You can click **Reset** to clear the search criteria. + + .. note:: + + The start time and end time are specified in **Time Range**. You can search for alarms generated within the time range. + + Handle the alarm by referring to **Alarm Reference**. If the alarms in some scenarios are generated due to other cloud services that MRS depends on, you need to contact maintenance personnel of the corresponding cloud services. + +4. If the alarm needs to be manually cleared after errors are rectified, click **Clear Alarm**. + + .. note:: + + If multiple alarms have been handled, you can select one or more alarms to be cleared and click **Clear Alarm** to clear the alarms in batches. A maximum of 300 alarms can be cleared in each batch. + +Exporting Alarms +---------------- + +#. Choose **Clusters > Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Alarm Management** > **Alarms**. +#. Click **Export All**. +#. In the displayed dialog box, select the type and click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001348737925.png +.. |image2| image:: /_static/images/en-us_image_0000001296058164.jpg +.. |image3| image:: /_static/images/en-us_image_0000001295898324.jpg +.. |image4| image:: /_static/images/en-us_image_0000001296058168.jpg +.. |image5| image:: /_static/images/en-us_image_0000001295738372.jpg +.. |image6| image:: /_static/images/en-us_image_0000001348738185.jpg +.. |image7| image:: /_static/images/en-us_image_0000001296217800.jpg +.. |image8| image:: /_static/images/en-us_image_0000001349257461.jpg +.. |image9| image:: /_static/images/en-us_image_0000001349057985.jpg diff --git a/umn/source/managing_clusters/alarm_management/viewing_the_alarm_list.rst b/umn/source/managing_clusters/alarm_management/viewing_the_alarm_list.rst new file mode 100644 index 0000000..2516acb --- /dev/null +++ b/umn/source/managing_clusters/alarm_management/viewing_the_alarm_list.rst @@ -0,0 +1,97 @@ +:original_name: en-us_topic_0040980162.html + +.. _en-us_topic_0040980162: + +Viewing the Alarm List +====================== + +The alarm list displays all alarms in the MRS cluster. The MRS page displays the alarms that need to be handled in a timely manner and the events. + +On the MRS management console, you can only query basic information about uncleared MRS alarms on the **Alarms** tab page. For details about how to view alarm details or manage alarms, see :ref:`Viewing and Manually Clearing an Alarm `. + +Alarms are listed in chronological order by default in the alarm list, with the most recent alarms displayed at the top. + +:ref:`Table 1 ` describes various fields in an alarm. + +.. _en-us_topic_0040980162__table5924273517010: + +.. table:: **Table 1** Alarm description + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Alarm ID | ID of an alarm. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Alarm Name | Name of an alarm. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Severity | Alarm severity. | + | | | + | | In versions earlier than MRS 3.x, the cluster alarm severity is as follows: | + | | | + | | - Critical | + | | | + | | Indicates alarms reporting errors that affect cluster running, such as unavailable cluster services, node faults, data inconsistency between the active and standby GaussDB databases, and abnormal LdapServer data synchronization. You need to check the cluster status based on the alarms and rectify the faults in a timely manner. | + | | | + | | - Major | + | | | + | | Indicates alarms reporting errors that affect some cluster functions, including process faults, periodic backup task failures, and abnormal key file permissions. Check the objects for which the alarms are generated based on the alarms and clear the alarms in a timely manner. | + | | | + | | - Minor | + | | | + | | Indicates alarms reporting errors that do not affect major functions of the current cluster, including alarms indicating that the certificate file is about to expire, audit logs fail to be dumped, and the license file is about to expire. | + | | | + | | - Warning | + | | | + | | Indicates an alarm of the lowest severity. It is used for information display or prompt and indicates that an event occurs in the scenarios when you stop a service, delete a service, stop an instance, delete an instance, delete a node, restart a service, restart an instance, perform an active/standby switchover for MRS Manager, scale in a host, or restore an instance. Additionally, this type of alarms also occurs when an instance is faulty, a job executed successfully, or a job failed to be executed. | + | | | + | | In MRS 3.x or later, the alarm severity of a cluster is as follows: | + | | | + | | - Critical | + | | | + | | Indicates alarms reporting errors that affect cluster running, such as unavailable cluster services, node faults, data inconsistency between the active and standby GaussDB databases, and abnormal LdapServer data synchronization. You need to check the cluster status based on the alarms and rectify the faults in a timely manner. | + | | | + | | - Major | + | | | + | | Indicates alarms reporting errors that affect some cluster functions, including process faults, periodic backup task failures, and abnormal key file permissions. Check the objects for which the alarms are generated based on the alarms and clear the alarms in a timely manner. | + | | | + | | - Minor | + | | | + | | Indicates alarms reporting errors that do not affect major functions of the current cluster, including alarms indicating that the certificate file is about to expire, audit logs fail to be dumped, and the license file is about to expire. | + | | | + | | - Suggestion | + | | | + | | Indicates an alarm of the lowest severity. It is used for information display or prompt and indicates that an event occurs in the scenarios when you stop a service, delete a service, stop an instance, delete an instance, delete a node, restart a service, restart an instance, perform an active/standby switchover for MRS Manager, scale in a host, or restore an instance. Additionally, this type of alarms also occurs when an instance is faulty, a job executed successfully, or a job failed to be executed. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Generated | Time when the alarm is generated. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Location | Details about the alarm. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | If the alarm can be manually cleared, click **Clear Alarm**. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** Button description + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Button | Description | + +===================================+===============================================================================================================================================================================================================+ + | |image1| | Select an interval for refreshing the alarm list from the drop-down list. | + | | | + | | - Refresh every 30s | + | | - Refresh every 60s | + | | - Stop refreshing | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image2| | Select an alarm severity from the drop-down list box to filter alarms. | + | | | + | | For versions earlier than MRS 3.x, the following alarms can be filtered: All, Critical, Major, Minor, and Warning. | + | | | + | | (For MRS 3.x or later) You can filter the following alarms: All, Critical, Major, Minor, and Warning. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image3| | Click |image4| and manually refresh the alarm list. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Advanced Search | Click **Advanced Search**. In the displayed alarm search area, set search criteria and click **Search** to view the information about specified alarms. You can click **Reset** to clear the search criteria. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001349137705.png +.. |image2| image:: /_static/images/en-us_image_0000001296217628.png +.. |image3| image:: /_static/images/en-us_image_0000001348737925.png +.. |image4| image:: /_static/images/en-us_image_0000001348737925.png diff --git a/umn/source/managing_clusters/alarm_management/viewing_the_event_list.rst b/umn/source/managing_clusters/alarm_management/viewing_the_event_list.rst new file mode 100644 index 0000000..f29428a --- /dev/null +++ b/umn/source/managing_clusters/alarm_management/viewing_the_event_list.rst @@ -0,0 +1,115 @@ +:original_name: mrs_01_0602.html + +.. _mrs_01_0602: + +Viewing the Event List +====================== + +The event list displays information about all events in a cluster, such as service restart and service termination. + +Events are listed in chronological order by default in the event list, with the most recent events displayed at the top. + +:ref:`Table 1 ` describes various fields for an event. + +.. _mrs_01_0602__en-us_topic_0173397435_table5924273517010: + +.. table:: **Table 1** Event description + + +-----------------------------------+--------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==========================================================================+ + | Event ID | Specifies the ID of an event. | + +-----------------------------------+--------------------------------------------------------------------------+ + | Event Severity | Specifies the event severity. | + | | | + | | In versions earlier than MRS 3.x, the cluster event level is as follows: | + | | | + | | - Critical | + | | - Major | + | | - Minor | + | | - Suggestion | + | | | + | | In MRS 3.x or later, the event level of a cluster is as follows: | + | | | + | | - Critical | + | | - Major | + | | - Minor | + | | - Suggestion | + +-----------------------------------+--------------------------------------------------------------------------+ + | Event Name | Name of the generated event. | + +-----------------------------------+--------------------------------------------------------------------------+ + | Generated | Time when the event is generated. | + +-----------------------------------+--------------------------------------------------------------------------+ + | Location | Specifies the detailed information for locating the event, | + +-----------------------------------+--------------------------------------------------------------------------+ + +.. table:: **Table 2** Icon description + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Icon | Description | + +===================================+=======================================================================================================================================================================================================+ + | |image1| | Select an interval for refreshing the event list from the drop-down list. | + | | | + | | - Refresh every 30s | + | | - Refresh every 60s | + | | - Stop refreshing | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image2| | Click |image3| to manually refresh the event list. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Advanced Search | Click **Advanced Search**. In the displayed event search area, set search criteria and click **Search** to view the information about specified events. Click **Reset** to clear the search criteria. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Exporting events +---------------- + +#. Choose **Clusters > Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Alarm Management** > **Events**. +#. Click **Export All**. +#. In the displayed dialog box, select the type and click **OK**. + +Common Events +------------- + +.. table:: **Table 3** Common events + + ======== ========================================================== + Event ID Event Name + ======== ========================================================== + 12019 Stop Service + 12020 Delete Service + 12021 Stop RoleInstance + 12022 Delete RoleInstance + 12023 Delete Node + 12024 Restart Service + 12025 Restart RoleInstance + 12026 Manager Switchover + 12065 Process Restart + 12070 Job Running Succeeded + 12071 Job Running Failed + 12072 Job killed + 12086 Agent Restart + 14005 NameNode Switchover + 14028 HDFS DiskBalancer Task + 14029 Active NameNode enters safe mode and generates new Fsimage + 17001 Oozie Workflow Execution Failure + 17002 Oozie Scheduled Job Execution Failure + 18001 ResourceManager Switchover + 18004 JobHistoryServer Switchover + 19001 HMaster Failover + 20003 Hue Failover + 24002 Flume Channel Overflow + 25001 LdapServer Failover + 27000 DBServer Switchover + 38003 Adjusts the topic data storage period + 43014 Spark Data Skew + 43015 Spark SQL Large Query Results + 43016 Spark SQL Execution Timeout + 43024 Start JDBCServer + 43025 Stop JDBCServer + 43026 ZooKeeper Connection Succeeded + 43027 Zookeeper Connection Failed + ======== ========================================================== + +.. |image1| image:: /_static/images/en-us_image_0000001295898004.png +.. |image2| image:: /_static/images/en-us_image_0000001296217480.png +.. |image3| image:: /_static/images/en-us_image_0000001348737865.png diff --git a/umn/source/managing_clusters/bootstrap_actions/adding_a_bootstrap_action.rst b/umn/source/managing_clusters/bootstrap_actions/adding_a_bootstrap_action.rst new file mode 100644 index 0000000..13aa65e --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/adding_a_bootstrap_action.rst @@ -0,0 +1,61 @@ +:original_name: mrs_01_0416.html + +.. _mrs_01_0416: + +Adding a Bootstrap Action +========================= + +Add a bootstrap action. + +This operation applies to MRS 3.\ *x* or earlier clusters. + +Procedure +--------- + +#. Log in to the MRS management console. +#. Choose **Clusters** > **Active Clusters** and click the name of your desired cluster. +#. On the page that is displayed, click the **Bootstrap Actions** tab. +#. Click **Add** and set parameters as prompted. + + .. table:: **Table 1** Parameters + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+====================================================================================================================================================================================================+ + | Name | Name of a bootstrap action script | + | | | + | | The value can contain only digits, letters, spaces, hyphens (-), and underscores (_) and must not start with a space. | + | | | + | | The value can contain 1 to 64 characters. | + | | | + | | .. note:: | + | | | + | | A name must be unique in the same cluster. You can set the same name for different clusters. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Script Path | Script path. The value can be an OBS file system path or a local VM path. | + | | | + | | - An OBS file system path must start with **s3a://** and end with **.sh**, for example, **s3a://mrs-samples/**\ *xxx*\ **.sh**. | + | | - A local VM path must start with a slash (/) and end with **.sh**. | + | | | + | | .. note:: | + | | | + | | A path must be unique in the same cluster, but can be the same for different clusters. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Bootstrap action script parameters | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Execution Node | Select a type of the node where the bootstrap action script is executed. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Executed | Select the time when the bootstrap action script is executed. | + | | | + | | - Before initial component start | + | | - After initial component start | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Action upon Failure | Whether to continue to execute subsequent scripts and create a cluster after the script fails to be executed. | + | | | + | | .. note:: | + | | | + | | You are advised to set this parameter to **Continue** in the debugging phase so that the cluster can continue to be installed and started no matter whether the bootstrap action is successful. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the configuration. +#. Click **Yes**. diff --git a/umn/source/managing_clusters/bootstrap_actions/deleting_a_bootstrap_action.rst b/umn/source/managing_clusters/bootstrap_actions/deleting_a_bootstrap_action.rst new file mode 100644 index 0000000..b7539a7 --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/deleting_a_bootstrap_action.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_24567.html + +.. _mrs_01_24567: + +Deleting a Bootstrap Action +=========================== + +Scenario +-------- + +Delete a bootstrap action on an MRS cluster. + +Procedure +--------- + +#. Log in to the MRS management console. +#. Choose **Clusters** > **Active Clusters** and click the name of your desired cluster. +#. On the page that is displayed, click the **Bootstrap Actions** tab. +#. In the list, select the item to be deleted and click **Delete**. +#. Click **OK**. diff --git a/umn/source/managing_clusters/bootstrap_actions/index.rst b/umn/source/managing_clusters/bootstrap_actions/index.rst new file mode 100644 index 0000000..d50f6b4 --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/index.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_24565.html + +.. _mrs_01_24565: + +Bootstrap Actions +================= + +- :ref:`Introduction to Bootstrap Actions ` +- :ref:`Preparing the Bootstrap Action Script ` +- :ref:`View Execution Records ` +- :ref:`Adding a Bootstrap Action ` +- :ref:`Modifying a Bootstrap Action ` +- :ref:`Deleting a Bootstrap Action ` +- :ref:`Sample Scripts ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction_to_bootstrap_actions + preparing_the_bootstrap_action_script + view_execution_records + adding_a_bootstrap_action + modifying_a_bootstrap_action + deleting_a_bootstrap_action + sample_scripts diff --git a/umn/source/managing_clusters/bootstrap_actions/introduction_to_bootstrap_actions.rst b/umn/source/managing_clusters/bootstrap_actions/introduction_to_bootstrap_actions.rst new file mode 100644 index 0000000..c3625df --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/introduction_to_bootstrap_actions.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0414.html + +.. _mrs_01_0414: + +Introduction to Bootstrap Actions +================================= + +Bootstrap actions indicate that you can run your scripts on a specified cluster node before or after starting big data components. You can run bootstrap actions to install additional third-party software, modify the cluster running environment, and perform other customizations. + +If you choose to run bootstrap actions when scaling out a cluster, the bootstrap actions will be run on the newly added nodes in the same way. If auto scaling is enabled in a cluster, you can add an automation script in addition to configuring a resource plan. Then the automation script executes the corresponding script on the nodes that are scaled out or in to implement custom operations. + +MRS runs the script you specify as user **root**. You can run the **su - XXX** command in the script to switch the user. + +.. note:: + + The bootstrap action scripts must be executed as user **root**. Improper use of the script may affect the cluster availability. Therefore, exercise caution when performing this operation. + +MRS determines the result based on the return code after the execution of the bootstrap action script. If the return code is **0**, the script is executed successfully. If the return code is not **0**, the execution fails. If a bootstrap action script fails to be executed on a node, the corresponding boot script will fail to be executed. In this case, you can set **Action upon Failure** to choose whether to continue to execute the subsequent scripts. Example 1: If you set **Action upon Failure** to **Continue** for all scripts during cluster creation, all the scripts will be executed regardless of whether they are successfully executed, and the startup process will be complete. Example 2: If a script fails to be executed and **Action upon Failure** is set to **Stop**, subsequent scripts will not be executed and cluster creation or scale-out will fail. + +You can add a maximum of 18 bootstrap actions, which will be executed before or after the cluster component is started in the order you specified. The bootstrap actions performed before or after the component startup must be completed within 60 minutes. Otherwise, the cluster creation or scale-out will fail. diff --git a/umn/source/managing_clusters/bootstrap_actions/modifying_a_bootstrap_action.rst b/umn/source/managing_clusters/bootstrap_actions/modifying_a_bootstrap_action.rst new file mode 100644 index 0000000..a7ec228 --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/modifying_a_bootstrap_action.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_24566.html + +.. _mrs_01_24566: + +Modifying a Bootstrap Action +============================ + +Scenario +-------- + +Modify an existing bootstrap action on an MRS cluster. + +Procedure +--------- + +#. Log in to the MRS management console. +#. Choose **Clusters** > **Active Clusters** and click the name of your desired cluster. +#. On the page that is displayed, click the **Bootstrap Actions** tab. +#. In the list, select the item to be modified and click **Edit**. +#. Modify the parameters as needed. +#. Click **OK** to save the modification. +#. Click **Yes**. diff --git a/umn/source/managing_clusters/bootstrap_actions/preparing_the_bootstrap_action_script.rst b/umn/source/managing_clusters/bootstrap_actions/preparing_the_bootstrap_action_script.rst new file mode 100644 index 0000000..8a84b61 --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/preparing_the_bootstrap_action_script.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0417.html + +.. _mrs_01_0417: + +Preparing the Bootstrap Action Script +===================================== + +Currently, bootstrap actions support Linux shell scripts only. Script files must end with **.sh**. + +Uploading the Installation Packages and Files to an OBS File System +------------------------------------------------------------------- + +Before compiling a script, you need to upload all required installation packages, configuration packages, and relevant files to the OBS file system in the same region. Because networks of different regions are isolated from each other, MRS VMs cannot download OBS files from other regions. + +Compiling a Script for Downloading Files from the OBS File System +----------------------------------------------------------------- + +You can specify the file to be downloaded from OBS in the script. If you upload files to a private file system, you need to run the **hadoop fs** command to download the files. The following example shows that the **obs://yourbucket/myfile.tar.gz** file will be downloaded to the local host and decompressed to the **/your-dir** directory. + +.. code-block:: text + + #!/bin/bash + source /opt/Bigdata/client/bigdata_env;hadoop fs -D fs.obs.endpoint= -D fs.obs.access.key= -D fs.obs.secret.key= -copyToLocal obs://yourbucket/myfile.tar.gz ./ + mkdir -p / + tar -zxvf myfile.tar.gz -C / + +.. note:: + + - In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + - The Hadoop client has been preinstalled on the MRS node. You can run the **hadoop fs** command to download or upload data from or to OBS. + - Obtain the obs-endpoint of each region. For details, see `Regions and Endpoints `__. + - :ref:`Sample Scripts ` shows that the installation packages have been uploaded to the public readable OBS file system. Therefore, you can run the **curl** command in the sample script to download the installation packages. + +Uploading the Script to the OBS File System +------------------------------------------- + +After script compilation, upload the script to the OBS file system in the same region. At the time you specify, each node in the cluster downloads the script from OBS and executes the script as user **root**. diff --git a/umn/source/managing_clusters/bootstrap_actions/sample_scripts.rst b/umn/source/managing_clusters/bootstrap_actions/sample_scripts.rst new file mode 100644 index 0000000..5673b19 --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/sample_scripts.rst @@ -0,0 +1,82 @@ +:original_name: mrs_01_0418.html + +.. _mrs_01_0418: + +Sample Scripts +============== + +Zeppelin +-------- + +Zeppelin is a web-based notebook that supports interactive data analysis. For more information, visit the Zeppelin official website at http://zeppelin.apache.org/. + +This sample script is used to automatically install Zeppelin. Select the corresponding script path based on the region where the cluster is to be created. Enter the script path in **Script Path** on the **Bootstrap Action** page when adding a bootstrap action during cluster creation. You do not need to enter parameters for this script. Based on the Zeppelin usage habit, you only need to run the script on the active Master node. + +- Script path that you need to enter when adding the bootstrap action: s3a://mrs-samples-bootstrap-eu-de/zeppelin/zeppelin_install.sh +- Path for downloading the sample script: https://mrs-samples-bootstrap-eu-de.obs.eu-de.otc.t-systems.com/zeppelin/zeppelin_install.sh + +After the bootstrap action is complete, use either of the following methods to verify that Zeppelin is correctly installed. + +Method 1: Log in to the active Master node as user **root** and run **/home/apache/zeppelin-0.7.3-bin-all/bin/zeppelin-daemon.sh status**. If the message stating "Zeppelin is running [ OK ]" is displayed, the installation is successful. + +Method 2: Start a Windows ECS in the same VPC. Access port 7510 of the active Master node in the cluster. If the Zeppelin page is displayed, the installation is successful. + +Presto +------ + +Presto is an open-source distributed SQL query engine, which is applicable to interactive analysis and query. For more information, visit the official website at http://prestodb.io/. + +The sample script can be used to automatically install Presto. The script path is as follows: + +- Script path that you need to enter when adding the bootstrap action: s3a://mrs-samples-bootstrap-eu-de/presto/presto_install.sh +- Path for downloading the sample script: https://mrs-samples-bootstrap-eu-de.obs.eu-de.otc.t-systems.com/presto/presto_install.sh + +Based on the Presto usage habit, you are advised to install **dualroles** on the active Master nodes and **worker** on the Core nodes. You are advised to add the boot operation script and configure the parameters as follows: + +.. table:: **Table 1** Bootstrap action script parameters + + +-----------------------------------+---------------------------------------------------------------------------------------+ + | Script 1 | Name: install dualroles | + | | | + | | Script Path: Select the path of the **presto-install.sh** script based on the region. | + | | | + | | Execution Node: Active Master | + | | | + | | Parameters: dualroles | + | | | + | | Execution Time: After component start | + | | | + | | Failed Action: Continue | + +-----------------------------------+---------------------------------------------------------------------------------------+ + | Script 2 | Name: install worker | + | | | + | | Script Path: Select the path of the **presto-install.sh** script based on the region. | + | | | + | | Execution Node: Core | + | | | + | | Parameters: worker | + | | | + | | Execution Time: After component start | + | | | + | | Failed Action: Continue | + +-----------------------------------+---------------------------------------------------------------------------------------+ + +After the bootstrap action is complete, you can start a Windows ECS in the same VPC of the cluster and access port 7520 of the active Master node to view the Presto web page. + +You can also log in to the active Master node to try Presto and run the following commands as user **root**: + +Command for loading the environment variable (In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation.): + +**#source /opt/Bigdata/client/bigdata_env** + +Command for viewing the process status: + +**#/home/apache/presto/presto-server-0.201/bin/launcher status** + +Command for connecting to Presto and performing the operation + +**#/home/apache/presto/presto-server-0.201/bin/presto --server localhost:7520 --catalog tpch --schema sf100** + +**presto:sf100> select \* from nation;** + +**presto:sf100> select count(*) from customer** diff --git a/umn/source/managing_clusters/bootstrap_actions/view_execution_records.rst b/umn/source/managing_clusters/bootstrap_actions/view_execution_records.rst new file mode 100644 index 0000000..f0aee1f --- /dev/null +++ b/umn/source/managing_clusters/bootstrap_actions/view_execution_records.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0415.html + +.. _mrs_01_0415: + +View Execution Records +====================== + +You can view the execution result of the bootstrap operation on the **Bootstrap Action** page. + +Viewing the Execution Result +---------------------------- + +#. Log in to the MRS console. + +#. In the left navigation pane, choose **Clusters** > **Active Clusters**. Click a cluster you want to query. + + The cluster details page is displayed. + +#. On the cluster details page, click the **Bootstrap Action** tab. Information about the bootstrap actions added during cluster creation is displayed. + + .. note:: + + - You select **Before initial component start** or **After initial component start** in the upper right corner to query information about the related bootstrap actions. + - The last execution result is listed here. For a newly created cluster, the records of bootstrap actions executed during cluster creation are listed. If a cluster is expanded, the records of bootstrap actions executed on the newly added nodes are listed. + +Viewing Execution Logs +---------------------- + +If you want to view the run logs of a bootstrap action, set **Action upon Failure** to **Continue** when adding the bootstrap action. And then, log in to each node to view the run logs in the **/var/log/Bootstrap** directory. If you add bootstrap actions before and after component start, you can distinguish bootstrap action logs of the two phases based on the timestamps. + +You are advised to print logs in detail in the script so that you can view the detailed run result. MRS redirects the standard output and error output of the script to the log directory of the bootstrap action. diff --git a/umn/source/managing_clusters/cluster_o&m/changing_the_subnet_of_a_cluster.rst b/umn/source/managing_clusters/cluster_o&m/changing_the_subnet_of_a_cluster.rst new file mode 100644 index 0000000..885890f --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/changing_the_subnet_of_a_cluster.rst @@ -0,0 +1,105 @@ +:original_name: mrs_01_24259.html + +.. _mrs_01_24259: + +Changing the Subnet of a Cluster +================================ + +If the current subnet does not have sufficient IP addresses, you can change to another subnet in the same VPC of the current cluster to obtain more available subnet IP addresses. Changing a subnet does not affect the IP addresses or subnets of existing nodes. + +For details about how to configure network ACL outbound rules, see :ref:`How Do I Configure a Network ACL Outbound Rule? ` + +Changing a Subnet When No Network ACL Is Associated +--------------------------------------------------- + +#. Log in to the MRS console. + +#. Click the target cluster name to go to its details page. + +#. Click **Change Subnet** on the right of **Default Subnet**. + +#. Select the target subnet and click **OK**. + + If no subnet is available, click **Create Subnet** to create a subnet first. + +Changing a Subnet When a Network ACL Is Associated +-------------------------------------------------- + +#. Log in to the MRS console and click the target cluster to go to its details page. + +#. .. _mrs_01_24259__li169975160296: + + In the **Basic Information** area, view **VPC**. + +#. .. _mrs_01_24259__li16830135519358: + + Log in to the VPC console. In the navigation pane on the left, choose **Virtual Private Cloud** and obtain the IPv4 CIDR block corresponding to the VPC obtained in :ref:`2 `. + +#. .. _mrs_01_24259__li69549305519: + + Choose **Access Control** > **Network ACLs** and click the name of the network ACL that is associated with the default and new subnets. + + .. note:: + + If both the default and new subnets are associated with a network ACL, add inbound rules to the network ACL by referring to :ref:`5 ` to :ref:`7 `. + + + .. figure:: /_static/images/en-us_image_0000001348738033.png + :alt: **Figure 1** Network ACLs + + **Figure 1** Network ACLs + +#. .. _mrs_01_24259__li1734493314818: + + On the **Inbound Rules** page, choose **More** > **Insert Rule Above** in the **Operation** column. + +#. Add a network ACL rule. Set **Action** to **Allow**, **Source** to the VPC IPv4 CIDR block obtained in :ref:`3 `, and retain the default values for other parameters. + +#. .. _mrs_01_24259__li13751204692115: + + Click **OK**. + + .. note:: + + If you do not want to allow access from all IPv4 CIDR blocks of the VPC, add the IPv4 CIDR blocks of the default and new subnets by performing :ref:`8 ` to :ref:`12 `. If the rules for VPC IPv4 CIDR blocks have been added, skip :ref:`8 ` to :ref:`12 `. + +#. .. _mrs_01_24259__li211072116246: + + Log in to the MRS console. + +#. Click the target cluster to go to its details page. + +#. Click **Change Subnet** on the right of **Default Subnet**. + +#. Obtain the IPv4 CIDR blocks of the default and new subnets. + + .. important:: + + In this case, you do not need to click **OK** displayed in the **Change Subnet** dialog box. Otherwise, the default subnet will be updated to the new subnet, thereby making it difficult to query the IPv4 CIDR block of the default subnet. Exercise caution when performing this operation. + +#. .. _mrs_01_24259__li7329647202914: + + Add the IPv4 CIDR blocks of the default and target subnets to the inbound rules of the network ACL bound to the two subnets by referring to :ref:`4 ` to :ref:`7 `. + +#. Log in to the MRS console. + +#. Click the target cluster to go to its details page. + +#. Click **Change Subnet** on the right of **Default Subnet**. + +#. Select the target subnet and click **OK**. + +.. _mrs_01_24259__section1070017367443: + +How Do I Configure a Network ACL Outbound Rule? +----------------------------------------------- + +- Method 1 + + Allow all outbound traffic. This method ensures that clusters can be created and used properly. + +- Method 2 + + Allow the mandatory outbound rules that can ensure the successful creation of clusters. You are not advised to use this method because created clusters may not run properly due to absent outbound rules. If the preceding problem occurs, contact O&M personnel. + + Similar to the example provided in method 1, set **Action** to **Allow** and add the outbound rules whose destinations are the address with **Secure Communications** enabled, NTP server address, OBS server address, OpenStack address, and DNS server address, respectively. diff --git a/umn/source/managing_clusters/cluster_o&m/checking_health_status/before_you_start.rst b/umn/source/managing_clusters/cluster_o&m/checking_health_status/before_you_start.rst new file mode 100644 index 0000000..c61224e --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/checking_health_status/before_you_start.rst @@ -0,0 +1,12 @@ +:original_name: mrs_01_0603.html + +.. _mrs_01_0603: + +Before You Start +================ + +This section describes how to manage health checks on the MRS console. + +Health check management operations on the MRS console apply only to clusters of **MRS 1.9.2** **to MRS 2.1.x**. + +Health check management on Manager applies to all versions. For MRS 3.x and later versions, see :ref:`Viewing a Health Check Task `. For versions earlier than MRS 3.x, see :ref:`Performing a Health Check `. diff --git a/umn/source/managing_clusters/cluster_o&m/checking_health_status/index.rst b/umn/source/managing_clusters/cluster_o&m/checking_health_status/index.rst new file mode 100644 index 0000000..81f8f83 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/checking_health_status/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0223.html + +.. _mrs_01_0223: + +Checking Health Status +====================== + +- :ref:`Before You Start ` +- :ref:`Performing a Health Check ` +- :ref:`Viewing and Exporting a Health Check Report ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + before_you_start + performing_a_health_check + viewing_and_exporting_a_health_check_report diff --git a/umn/source/managing_clusters/cluster_o&m/checking_health_status/performing_a_health_check.rst b/umn/source/managing_clusters/cluster_o&m/checking_health_status/performing_a_health_check.rst new file mode 100644 index 0000000..ddf3470 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/checking_health_status/performing_a_health_check.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_0224.html + +.. _mrs_01_0224: + +Performing a Health Check +========================= + +Scenario +-------- + +To ensure that cluster parameters, configurations, and monitoring are correct and that the cluster can run stably for a long time, you can perform a health check during routine maintenance. + +.. note:: + + A system health check includes MRS Manager, service-level, and host-level health checks: + + - MRS Manager health checks focus on whether the unified management platform can provide management functions. + - Service-level health checks focus on whether components can provide services properly. + - Host-level health checks focus on whether host indicators are normal. + + The system health check includes three types of check items: health status, related alarms, and customized monitoring indicators for each check object. The health check results are not always the same as the **Health Status** on the portal. + +Procedure +--------- + +- Manually perform the health check for all services. + + On the MRS details page, choose **Management Operations** > **Start Cluster Health Check**. + + .. note:: + + For the operations on MRS Manager of MRS 1.7.2 or earlier, see :ref:`Performing a Health Check `; for the operations on FusionInsight Manager of MRS 3.\ *x* or later, see :ref:`Overview `. + + - The cluster health check includes Manager, service, and host status checks. + - To perform cluster health checks, you can also choose **System** > **Check Health Status** > **Start Cluster Health Check** on MRS Manager. + - To export the health check result, click **Export Report** in the upper left corner. + +- Manually perform the health check for a service. + + #. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Performing a Health Check `. + + #. Select the target service from the service list. + #. Choose **More** > **Start Service Health Check** to start the health check for the service. + +- Manually perform the health check for a host. + + #. On the MRS details page, click **Nodes**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Performing a Health Check `. For MRS 3.\ *x* or later, see :ref:`Performing a Host Health Check `. + + #. Expand the node group information and select the check box of the host to be checked. + #. Choose **Node** > **Start Host Health Check** to start the health check for the host. diff --git a/umn/source/managing_clusters/cluster_o&m/checking_health_status/viewing_and_exporting_a_health_check_report.rst b/umn/source/managing_clusters/cluster_o&m/checking_health_status/viewing_and_exporting_a_health_check_report.rst new file mode 100644 index 0000000..a2ce551 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/checking_health_status/viewing_and_exporting_a_health_check_report.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0225.html + +.. _mrs_01_0225: + +Viewing and Exporting a Health Check Report +=========================================== + +Scenario +-------- + +You can view the health check result on MRS and export it for further analysis. + +.. note:: + + A system health check includes MRS Manager, service-level, and host-level health checks: + + - MRS Manager health checks focus on whether the unified management platform can provide management functions. + - Service-level health checks focus on whether components can provide services properly. + - Host-level health checks focus on whether host indicators are normal. + + The system health check includes three types of check items: health status, related alarms, and customized monitoring indicators for each check object. The health check results are not always the same as the **Health Status** on the portal. + +Prerequisites +------------- + +You have performed a health check. + +Procedure +--------- + +#. On the MRS details page, choose **Management Operations** > **View Cluster Health Check Report**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Viewing and Exporting a Health Check Report `. For MRS 3.x, see :ref:`Managing Health Check Reports `. + +#. Click **Export Report** on the health check report pane to export the report and view detailed information about check items. diff --git a/umn/source/managing_clusters/cluster_o&m/configuring_message_notification.rst b/umn/source/managing_clusters/cluster_o&m/configuring_message_notification.rst new file mode 100644 index 0000000..5f4c341 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/configuring_message_notification.rst @@ -0,0 +1,115 @@ +:original_name: mrs_01_0062.html + +.. _mrs_01_0062: + +Configuring Message Notification +================================ + +MRS uses SMN to offer a publish/subscribe model to achieve one-to-multiple message subscriptions and notifications in a variety of message types (SMSs and emails). + +Scenario +-------- + +On the MRS management console, you can enable or disable the notification service on the **Alarms** page. The functions in the following scenarios can be implemented only after the required cluster function is enabled: + +- After a user subscribes to the notification service, the MRS management plane notifies the user of success or failure of manual cluster scale-out and scale-in, cluster termination, and auto scaling by emails or SMS messages. +- The management plane checks the alarms about the MRS cluster and sends a notification to the tenant if the alarms are critical. +- If either of the operations such as deletion, shutdown, specifications modification, restart, and OS update is performed on an ECS in a cluster, the MRS cluster works abnormally. The management plane notifies a user when detecting that the VM of the user is in either of the preceding operations. + +Creating a Topic +---------------- + +A topic is a specified event for message publication and notification subscription. It serves as a message sending channel, where publishers and subscribers can interact with each other. + +#. Log in to the management console. + +#. Click **Service List**. Under **Management & Governance**, click **Simple Message Notification**. + + The **SMN** page is displayed. + +#. In the navigation pane, choose **Topic Management** > **Topics**. + + The **Topics** page is displayed. + +#. Click **Create Topic**. + + The **Create Topic** dialog box is displayed. + +#. In **Topic Name**, enter a topic name. In **Display Name**, enter a display name. + +#. Select an existing project from the **Enterprise Project** drop-down list, or click **Create Enterprise Project** to create an enterprise project on the **Enterprise Project Management** page and then select it. + +#. Set tag keys and tag values. Tags consist of keys and values. They identify cloud resources so that you can easily categorize and search for your resources. + +.. _mrs_01_0062__section186691424145018: + +Adding Subscriptions to a Topic +------------------------------- + +To deliver messages published to a topic to subscribers, you must add subscription endpoints to the topic. SMN automatically sends a confirmation message to the subscription endpoint. The confirmation message is valid only within 48 hours. The subscribers must confirm the subscription within 48 hours so that they can receive notification messages. Otherwise, the confirmation message becomes invalid, and you need to send it again. + +#. Log in to the management console. + +#. Under **Management & Governance**, click **Simple Message Notification**. + + The **SMN** page is displayed. + +#. In the navigation pane, choose **Topic Management** > **Topics**. + + The **Topics** page is displayed. + +#. Locate the topic to which you want to add a subscription, click **More** in the **Operation** column, and select **Add Subscription**. + + The **Add Subscription** box is displayed. + + Protocol can be set to **SMS**, FunctionGraph (function), **HTTP**, **HTTPS**, and **Email**. + + **Endpoint** indicates the address of the subscription endpoint. SMS and email, endpoints can be entered in batches. When adding endpoints in batches, each endpoint address occupies a line. You can enter a maximum of 10 endpoints. + +5. Click **OK**. + +The subscription you added is displayed in the subscription list. + +Sending Notifications to Subscribers +------------------------------------ + +#. Log in to the MRS console. +#. Click |image1| in the upper-left corner on the management console and select a region and project. +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. +#. Click **Alarms**. +#. Choose **Notification Rules** > **Add Notification Rule**. The **Add Notification Rule** page is displayed. +#. Set the notification rule parameters. + + .. table:: **Table 1** Parameters of a notification rule + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=====================================================================================================================+ + | Rule Name | User-defined notification rule name. Only digits, letters, hyphens (-), and underscores (_) are allowed. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + | Message Notification | - If you enable this function, the system sends notifications to subscribers based on the notification rule. | + | | - If you disable this function, the rule does not take effect, that is, notifications are not sent to subscribers. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + | Topic Name | Select an existing topic or click **Create Topic** to create a topic. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + | Notification Type | Select the type of the notification to be subscribed to. | + | | | + | | - Alarm | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + | Subscription Items | Select the items to be subscribed to. You can select all or some items as required. | + | | | + | | Subscription rules in MRS 3.\ *x* or later: | + | | | + | | Alarm severity: critical, major, and minor | + | | | + | | Subscription rules in versions earlier than MRS 3.x: | + | | | + | | - Critical | + | | - Major | + | | - Minor | + | | - Suggestion | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001295898140.png diff --git a/umn/source/managing_clusters/cluster_o&m/importing_and_exporting_data.rst b/umn/source/managing_clusters/cluster_o&m/importing_and_exporting_data.rst new file mode 100644 index 0000000..c51adf6 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/importing_and_exporting_data.rst @@ -0,0 +1,165 @@ +:original_name: en-us_topic_0019489057.html + +.. _en-us_topic_0019489057: + +Importing and Exporting Data +============================ + +Through the **Files** tab page, you can create, delete, import, export, delete files in the analysis cluster. Currently, file creation is not supported. Streaming clusters do not support the file management function on the MRS GUI. In a cluster with Kerberos authentication enabled, to read or write the folders in the root directory, add a role that has the required permissions on the folders by referring to :ref:`Creating a Role `. Then, add the new role to the user group to which the user who submits the job belongs by referring to :ref:`Related Tasks `. + +Background +---------- + +Data sources processed by MRS are from OBS or HDFS. OBS is an object-based storage service that provides you with massive, secure, reliable, and cost-effective data storage capabilities. MRS can process data in OBS directly. You can view, manage, and use data by using the web page of the management control platform or OBS client. In addition, you can use REST APIs independently or integrate APIs to service applications to manage and access data. + +Before creating jobs, upload the local data to OBS for MRS to compute and analyze. MRS allows exporting data from OBS to HDFS for computing and analyzing. After the data analysis and computing are completed, you can store the data in HDFS or export them to OBS. HDFS and OBS can also store the compressed data in the format of **bz2** or **gz**. + +.. _en-us_topic_0019489057__section6302178417377: + +Importing Data +-------------- + +Currently, MRS can only import data from OBS to HDFS. The file upload rate decreases with the increase of the file size. This mode applies to scenarios where the data volume is small. + +You can perform the following steps to import files and directories: + +#. Log in to the MRS console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's information. + +#. Click the **Files** tab, and go to the file management page. + +#. Select **HDFS File List**. + +#. Go to the data storage directory, for example, **bd_app1**. + + The **bd_app1** directory is only an example. You can use any directory on the page or create a new one. + + The requirements for creating a folder are as follows: + + - The folder name contains a maximum of 255 characters, and the full path cannot exceed 1,023 characters. + - The folder name cannot be empty. + - The folder name cannot contain the following special characters: :literal:`/:*?"<>|\\;&,'`!{}[]$%+` + - The value cannot start or end with a period (.). + - The spaces at the beginning and end are ignored. + +#. Click **Import Data** and configure the HDFS and OBS paths correctly. When configuring the OBS or HDFS path, click **Browse**, select a file directory, and click **Yes**. + + - OBS path + + - The path must start with **obs://**. MRS 1.7.2 or earlier: The value must start with **s3a://**. + - Files or programs encrypted by KMS cannot be imported. + - An empty folder cannot be imported. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The OBS full path contains a maximum of 1,023 characters. + + - HDFS path + + - The path starts with **/user** by default. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The HDFS full path contains a maximum of 1,023 characters. + - The HDFS parent directory in **HDFS File List** is displayed in the **HDFS Path** text box by default. + +#. Click **OK**. + + You can view the file upload progress on the **File Operation Records** tab page. MRS processes the data import operation as a DistCp job. You can also check whether the DistCp job is successfully executed on the **Jobs** tab page. + +Exporting Data +-------------- + +After the data analysis and computing are completed, you can store the data in HDFS or export them to OBS. + +You can perform the following steps to export files and directories: + +#. Log in to the MRS console. + +#. Click |image2| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. + +#. Click the **Files** tab, and the file management page is displayed. + +#. Select **HDFS File List**. + +#. Go to the data storage directory, for example, **bd_app1**. + +#. Click **Export Data** and configure the OBS and HDFS paths. When configuring the OBS or HDFS path, click **Browse**, select a file directory, and click **Yes**. + + - OBS path + + - The path must start with **obs://**. MRS 1.7.2 or earlier: The value must start with **s3a://**. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The OBS full path contains a maximum of 1,023 characters. + + - HDFS path + + - The path starts with **/user** by default. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The HDFS full path contains a maximum of 1,023 characters. + - The HDFS parent directory in **HDFS File List** is displayed in the **HDFS Path** text box by default. + + .. note:: + + When a folder is exported to OBS, a label file named **folder name_$folder$** is added to the OBS path. Ensure that the exported folder is not empty. If the exported folder is empty, OBS cannot display the folder and only generates a file named **folder name_$folder$**. + +#. Click **OK**. + + You can view the file upload progress on the **File Operation Records** tab page. MRS processes the data export operation as a DistCp job. You can also check whether the DistCp job is successfully executed on the **Jobs** tab page. + +Viewing Operation Logs +---------------------- + +When importing and exporting data on the MRS management console, you can choose **Files > File Operation Records** to view the data import and export progress. + +:ref:`Table 1 ` describes the parameters of the file operation record. + +.. _en-us_topic_0019489057__table59621065102929: + +.. table:: **Table 1** File operation record parameters + + +-----------------------------------+---------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================+ + | Submitted | Start time of data import or export. | + +-----------------------------------+---------------------------------------------------+ + | Source Path | Source path of data. | + | | | + | | - OBS path during data import. | + | | - HDFS path during data export. | + +-----------------------------------+---------------------------------------------------+ + | Target Path | Target path of data. | + | | | + | | - HDFS path during data import. | + | | - OBS path during data import. | + +-----------------------------------+---------------------------------------------------+ + | Status | Status during data import or export. | + | | | + | | - Submitted | + | | - Accepted | + | | - Running | + | | - Completed | + | | - Terminated | + | | - Abnormal | + +-----------------------------------+---------------------------------------------------+ + | Duration (min) | Time of data import or export. | + | | | + | | The unit is minute. | + +-----------------------------------+---------------------------------------------------+ + | Result | Result of data import or export. | + | | | + | | - Successful | + | | - Failed | + | | - Killed | + | | - Undefined | + +-----------------------------------+---------------------------------------------------+ + | Operation | View Log: allows you to view file operation logs. | + +-----------------------------------+---------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296217736.png +.. |image2| image:: /_static/images/en-us_image_0000001295738308.png diff --git a/umn/source/managing_clusters/cluster_o&m/index.rst b/umn/source/managing_clusters/cluster_o&m/index.rst new file mode 100644 index 0000000..78282f3 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/index.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_24295.html + +.. _mrs_01_24295: + +Cluster O&M +=========== + +- :ref:`Importing and Exporting Data ` +- :ref:`Changing the Subnet of a Cluster ` +- :ref:`Configuring Message Notification ` +- :ref:`Checking Health Status ` +- :ref:`Remote O&M ` +- :ref:`Viewing MRS Operation Logs ` +- :ref:`Terminating a Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + importing_and_exporting_data + changing_the_subnet_of_a_cluster + configuring_message_notification + checking_health_status/index + remote_o&m/index + viewing_mrs_operation_logs + terminating_a_cluster diff --git a/umn/source/managing_clusters/cluster_o&m/remote_o&m/authorizing_o&m.rst b/umn/source/managing_clusters/cluster_o&m/remote_o&m/authorizing_o&m.rst new file mode 100644 index 0000000..2874736 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/remote_o&m/authorizing_o&m.rst @@ -0,0 +1,19 @@ +:original_name: mrs_01_0641.html + +.. _mrs_01_0641: + +Authorizing O&M +=============== + +If you need technical support personnel to help you with troubleshooting, you can use the O&M authorization function to authorize technical support personnel to access your local host for fault location. + +Procedure +--------- + +#. Log in to the MRS management console. +#. Click |image1| in the upper-left corner on the management console and select a region and project. +#. In the navigation tree of the MRS management console, choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. +#. In the upper right corner of the page, click **O&M**, choose **Authorize O&M**, and select the deadline for the support personnel to access the local host. Before the deadline, the support personnel have the temporary permission to access the local host. +#. After the fault is rectified, click **O&M** in the upper right corner of the page and select **Cancel Authorization** to cancel the access permission for the support personnel. + +.. |image1| image:: /_static/images/en-us_image_0000001295738168.png diff --git a/umn/source/managing_clusters/cluster_o&m/remote_o&m/index.rst b/umn/source/managing_clusters/cluster_o&m/remote_o&m/index.rst new file mode 100644 index 0000000..c10f970 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/remote_o&m/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0520.html + +.. _mrs_01_0520: + +Remote O&M +========== + +- :ref:`Authorizing O&M ` +- :ref:`Sharing Logs ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + authorizing_o&m + sharing_logs diff --git a/umn/source/managing_clusters/cluster_o&m/remote_o&m/sharing_logs.rst b/umn/source/managing_clusters/cluster_o&m/remote_o&m/sharing_logs.rst new file mode 100644 index 0000000..481613a --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/remote_o&m/sharing_logs.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0642.html + +.. _mrs_01_0642: + +Sharing Logs +============ + +If you need technical support personnel to help you with troubleshooting, you can use the log sharing function to provide logs in a specific time to technical support personnel for fault location. + +Procedure +--------- + +#. Log in to the MRS management console. +#. Click |image1| in the upper-left corner on the management console and select a region and project. +#. In the navigation tree of the MRS management console, choose **Clusters** > **Active Clusters**, select a cluster, and click its name to switch to the cluster details page. +#. In the upper right corner of the displayed page, choose **O&M** > **Share Log** to open the **Share Log** dialog box. +#. Select the start time and end time in **Time Range**. + + .. note:: + + - Select **Time Range** based on the suggestions of support personnel. + - **End Date** must be later than **Start Date**. Otherwise, logs cannot be filtered by time. + +.. |image1| image:: /_static/images/en-us_image_0000001348738349.png diff --git a/umn/source/managing_clusters/cluster_o&m/terminating_a_cluster.rst b/umn/source/managing_clusters/cluster_o&m/terminating_a_cluster.rst new file mode 100644 index 0000000..c477682 --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/terminating_a_cluster.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0042.html + +.. _mrs_01_0042: + +Terminating a Cluster +===================== + +You can terminate an MRS cluster after job execution is complete. + +Background +---------- + +You can manually terminate a cluster after data analysis is complete or when the cluster encounters an exception. A cluster failed to be deployed will be automatically terminated. + +Procedure +--------- + +#. Log in to the MRS console. + +#. In the navigation tree of the MRS console, choose **Clusters** > **Active Clusters**. + +#. Locate the cluster to be terminated, and click **Terminate** in the **Operation** column. + + The cluster status changes from **Running** to **Terminating**, and finally to **Terminated**. You can view the clusters in **Terminated** state in **Cluster History**. diff --git a/umn/source/managing_clusters/cluster_o&m/viewing_mrs_operation_logs.rst b/umn/source/managing_clusters/cluster_o&m/viewing_mrs_operation_logs.rst new file mode 100644 index 0000000..85d4cff --- /dev/null +++ b/umn/source/managing_clusters/cluster_o&m/viewing_mrs_operation_logs.rst @@ -0,0 +1,84 @@ +:original_name: en-us_topic_0012808265.html + +.. _en-us_topic_0012808265: + +Viewing MRS Operation Logs +========================== + +You can view operation logs of clusters and jobs on the **Operation Logs** page. Log information is typically used for quickly locating faults in case of cluster exceptions, helping users resolve problems. + +Operation Type +-------------- + +Currently, the following operation logs are provided by MRS. You can filter the logs in the search box. + +- Cluster operations + + - Creating, deleting, scaling out, and scaling in a cluster + - Creating and deleting a directory, deleting a file + +- Job operations: Creating, stopping, and deleting a job +- Data operations: IAM user tasks, adding user, and adding user group + +Log Fields +---------- + +Logs are listed in chronological order by default in the log list, with the most recent logs displayed at the top. + +:ref:`Table 1 ` describes various fields in a log. + +.. _en-us_topic_0012808265__table5924273517010: + +.. table:: **Table 1** Log description + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===============================================================================================================================================================================================+ + | Operation Type | Various types of operations, including: | + | | | + | | - Cluster operations | + | | - Job operations | + | | - Data operations | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation IP | IP address where an operation is performed. | + | | | + | | .. note:: | + | | | + | | If an MRS cluster fails to be deployed, the cluster is automatically deleted, and the operation logs of the automatically deleted cluster do not contain the **Operation IP** of the user. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Operation details. The value can contain a maximum of 2048 characters. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Time | Operation time. For a deleted cluster, only logs generated within the last six months are displayed. To view logs generated six months ago, contact technical support. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** Icon description + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Icon | Description | + +===================================+===================================================================================================================================================================================================+ + | |image1| | Select an operation type from the drop-down list box to filter logs. | + | | | + | | - **All Operation Types**: Filter all logs. | + | | - **Cluster**: Filter logs for **Cluster**. | + | | - **Job**: Filter logs for **Job**. | + | | - **Data**: Filter logs for **Data**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image2| | Filter logs by time. | + | | | + | | #. Click the input box. | + | | #. Specify the date and time. | + | | #. Click **OK**. | + | | | + | | The left-side input box indicates the start time and the right-side one indicates the end time. The start time must be earlier than or equal to the end time. Otherwise, logs cannot be filtered. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image3| | Enter a keyword of the **Operation Details** in the search box and click |image4| to search for logs. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image5| | Click |image6| to manually refresh the log list. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001349257469.png +.. |image2| image:: /_static/images/en-us_image_0000001348738189.jpg +.. |image3| image:: /_static/images/en-us_image_0000001349057965.png +.. |image4| image:: /_static/images/en-us_image_0000001349057965.png +.. |image5| image:: /_static/images/en-us_image_0000001349057929.png +.. |image6| image:: /_static/images/en-us_image_0000001349057929.png diff --git a/umn/source/managing_clusters/cluster_overview/checking_the_cluster_status.rst b/umn/source/managing_clusters/cluster_overview/checking_the_cluster_status.rst new file mode 100644 index 0000000..8504c2e --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/checking_the_cluster_status.rst @@ -0,0 +1,126 @@ +:original_name: en-us_topic_0012808230.html + +.. _en-us_topic_0012808230: + +Checking the Cluster Status +=========================== + +The cluster list contains all clusters in MRS. You can view clusters in various states. If a large number of clusters are involved, navigate through multiple pages to view all of the clusters. + +MRS, as a platform managing and analyzing massive data, provides a PB-level data processing capability. MRS allows you to create multiple clusters. The cluster quantity is subject to that of ECSs. + +Clusters are listed in chronological order by default in the cluster list, with the most recent cluster displayed at the top. :ref:`Table 1 ` describes the cluster list parameters. + +- **Active Clusters**: contain all clusters except the clusters in the **Failed** and **Terminated** states. +- **Cluster History**: contains the tasks in the **Terminated** states. Only clusters terminated within the last six months are displayed. If you want to view clusters terminated six months ago, contact technical support engineers. +- **Failed Tasks**: only contain the tasks in the **Failed** state. Task failures include: + + - Cluster creation failure + - Cluster termination failure + - Cluster scale-out failure + - Cluster scale-in failure + - Cluster patch installation failure (supported only by versions earlier than MRS 3.x) + - Cluster patch uninstallation failure (supported only by versions earlier than MRS 3.x) + - Cluster specifications upgrade failure + +.. _en-us_topic_0012808230__table3950169215120: + +.. table:: **Table 1** Parameters in the active cluster list + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================================================================================================================================================================================================================================================================================================+ + | Name/ID | Cluster name, which is set when a cluster is created. Unique identifier of a cluster, which is automatically assigned when a cluster is created. | + | | | + | | - |image1|: Change the cluster name. | + | | - |image2|: Copy the cluster ID. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Version | Cluster version. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Nodes | Number of nodes that can be deployed in a cluster. This parameter is set when a cluster is created. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Status and operation progress description of a cluster. | + | | | + | | The cluster creation progress includes: | + | | | + | | - Verifying cluster parameters | + | | - Applying for cluster resources | + | | - Creating VMs | + | | - Initializing VMs | + | | - Installing MRS Manager | + | | - Deploying the cluster | + | | - Cluster installation failed | + | | | + | | The cluster scale-out progress includes: | + | | | + | | - Preparing for scale-out | + | | - Creating VMs | + | | - Initializing VMs | + | | - Adding nodes to the cluster | + | | - Scale-out failed | + | | | + | | The cluster scale-in progress includes: | + | | | + | | - Preparing for scale-in | + | | - Decommissioning instance | + | | - Deleting VMs | + | | - Deleting nodes from the cluster | + | | - Scale-in failed | + | | | + | | The system will display causes of cluster installation, scale-out, and scale-in failures. For details, see :ref:`Table 5 `. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Created | The cluster node is successfully created. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Terminated | Time when a cluster node stops and the cluster node begins to be terminated. This parameter is valid only for historical clusters displayed on the **Cluster History** page. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | AZ | Availability zone (AZ) in the region of a cluster, which is set when a cluster is created. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | **Terminate**: If you want to terminate a cluster after jobs are complete, click **Terminate**. The cluster status changes from **Running** to **Terminating**. After the cluster is terminated, the cluster status will change to **Terminated** and will be displayed in **Cluster History**. If the MRS cluster fails to be deployed, the cluster is automatically terminated. | + | | | + | | This parameter is displayed in **Active Clusters** only. | + | | | + | | .. note:: | + | | | + | | Typically after data is analyzed and stored, or when the cluster encounters an exception and cannot work, you can terminate a cluster. If a cluster is terminated before data processing and analysis are completed, data loss may occur. Therefore, exercise caution when terminating a cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** Button description + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Button | Description | + +===================================+=================================================================================================================================================================================================================================================================+ + | |image3| | In the drop-down list, select a status to filter clusters: | + | | | + | | - Active Clusters | + | | | + | | - All operation types: displays all existing clusters. | + | | - Starting: displays existing clusters in the **Starting** state. | + | | - Running: displays existing clusters in the **Running** state. | + | | - Scaling out: displays existing clusters in the **Scaling out** state. | + | | - Scaling in: displays existing clusters in the **Scaling in** state. | + | | - Abnormal: displays existing clusters in the **Abnormal** state. | + | | - Terminating: displays existing clusters in the **Terminating** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image4| | Choose **Clusters > Active Clusters** and click |image5| to go to the page for managing failed tasks. | + | | | + | | |image6| *Num*: displays the failed tasks in the **failed** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image7| | Enter a cluster name in the search bar and click |image8| to search for a cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Search by Tag | Click **Search by Tag**, enter the tag of the cluster to be queried, and click **Search** to search for the clusters. | + | | | + | | You can select a tag key or tag value from their drop-down lists. When the tag key or tag value is exactly matched, the system can automatically locate the target cluster. If you enter multiple tags, their intersections are used to search for the cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | |image9| | Click |image10| to manually refresh the cluster list. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296057872.png +.. |image2| image:: /_static/images/en-us_image_0000001349257477.png +.. |image3| image:: /_static/images/en-us_image_0000001349137889.png +.. |image4| image:: /_static/images/en-us_image_0000001296058044.jpg +.. |image5| image:: /_static/images/en-us_image_0000001296058044.jpg +.. |image6| image:: /_static/images/en-us_image_0000001296058044.jpg +.. |image7| image:: /_static/images/en-us_image_0000001349057965.png +.. |image8| image:: /_static/images/en-us_image_0000001349057965.png +.. |image9| image:: /_static/images/en-us_image_0000001349057929.png +.. |image10| image:: /_static/images/en-us_image_0000001349057929.png diff --git a/umn/source/managing_clusters/cluster_overview/cluster_list.rst b/umn/source/managing_clusters/cluster_overview/cluster_list.rst new file mode 100644 index 0000000..c7798ed --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/cluster_list.rst @@ -0,0 +1,62 @@ +:original_name: en-us_topic_0012799688.html + +.. _en-us_topic_0012799688: + +Cluster List +============ + +You can quickly view the status of all clusters and jobs by viewing the dashboard information, and obtain relevant MRS documents from **Overview** in the left navigation pane on the MRS console. + +MRS is used to manage and analyze massive data. It is easy to use. You can create a cluster and add MapReduce, Spark, and Hive jobs to the cluster to analyze and process user data. After being processed, you can transmit the data in SSL encryption mode to OBS to ensure data integrity and confidentiality. + +Cluster Status +-------------- + +:ref:`Table 1 ` lists the statuses of all MRS clusters after you log in to the MRS management console. + +.. _en-us_topic_0012799688__table164091551415: + +.. table:: **Table 1** Cluster status + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +===================================+=================================================================================================================================================================================================================+ + | Starting | If a cluster is being created, the cluster is in the **Starting** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Running | If a cluster is created successfully and all components in the cluster are normal, the cluster is in the **Running** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scaling out | If the Core or Task node in a cluster is being added, the cluster is in the **Scaling out** state. | + | | | + | | .. note:: | + | | | + | | If the cluster scale-out fails, you can add node to the cluster again. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scaling in | If you stop, delete, change or reinstall the OSs of cluster nodes, and modify the specifications of the cluster node, the cluster nodes are being terminated. Then, the cluster is in the **Scaling in** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Abnormal | If some components in a cluster are abnormal, the cluster is **Abnormal**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Terminating | If a cluster node is being terminated, the cluster is in the **Terminating** state. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Terminated | The cluster has been terminated. This parameter is displayed only in **Cluster History**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scaling up Master node | If the specifications of a master node are being upgraded, the cluster status is **Scaling up Master node**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Job Status +---------- + +:ref:`Table 2 ` describes the status of jobs that you execute after logging in to the MRS management console. + +.. _en-us_topic_0012799688__table792216529274: + +.. table:: **Table 2** Job status + + ========== ============================================================ + Status Description + ========== ============================================================ + Accepted Initial status of a job after it is successfully submitted. + Running A job is being executed. + Completed A job has been executed and completed successfully. + Terminated A job is stopped during execution. + Abnormal An error occurs during job execution or job execution fails. + ========== ============================================================ diff --git a/umn/source/managing_clusters/cluster_overview/index.rst b/umn/source/managing_clusters/cluster_overview/index.rst new file mode 100644 index 0000000..3286020 --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0514.html + +.. _mrs_01_0514: + +Cluster Overview +================ + +- :ref:`Cluster List ` +- :ref:`Checking the Cluster Status ` +- :ref:`Viewing Basic Cluster Information ` +- :ref:`Viewing Cluster Patch Information ` +- :ref:`Viewing and Customizing Cluster Monitoring Metrics ` +- :ref:`Managing Components and Monitoring Hosts ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + cluster_list + checking_the_cluster_status + viewing_basic_cluster_information + viewing_cluster_patch_information + viewing_and_customizing_cluster_monitoring_metrics + managing_components_and_monitoring_hosts diff --git a/umn/source/managing_clusters/cluster_overview/managing_components_and_monitoring_hosts.rst b/umn/source/managing_clusters/cluster_overview/managing_components_and_monitoring_hosts.rst new file mode 100644 index 0000000..3d42c24 --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/managing_components_and_monitoring_hosts.rst @@ -0,0 +1,237 @@ +:original_name: mrs_01_0517.html + +.. _mrs_01_0517: + +Managing Components and Monitoring Hosts +======================================== + +You can manage the following status and metrics of all components (including role instances) and hosts on the MRS console: + +- Status information: includes operation, health, configuration, and role instance status. +- Indicator information: includes key monitoring indicators for each component. +- Export monitoring metrics. (This function is not supported in MRS 3.x or later.) + +.. note:: + + - For MRS 1.7.2 or later and MRS 2.x, see :ref:`Managing Services and Monitoring Hosts `. + + - For MRS 3.x or later, see :ref:`Procedure `. + - You can set the interval for automatically refreshing the page or click |image1| to refresh the page immediately. + - Component management supports the following parameter values: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +.. _mrs_01_0517__section18139102419196: + +Procedure +--------- + +**Managing Components** + +.. note:: + + For details about how to perform operations on MRS Manager, see :ref:`Managing Service Monitoring `. + +#. On the MRS cluster details page, click **Components**. + + On the **Components** tab page, **Service**, **Operating Status**, **Health Status**, **Configuration Status**, **Role**, and **Operation** are displayed in the component list. + + - :ref:`Table 1 ` describes the service operating status. + + .. _mrs_01_0517__table4726131425215: + + .. table:: **Table 1** Service operating status + + +-----------------+------------------------------------------------------------------------+ + | Status | Description | + +=================+========================================================================+ + | Started | The service is started. | + +-----------------+------------------------------------------------------------------------+ + | Stopped | The service is stopped. | + +-----------------+------------------------------------------------------------------------+ + | Failed to start | Failed to start the role instance. | + +-----------------+------------------------------------------------------------------------+ + | Failed to stop | Failed to stop the service. | + +-----------------+------------------------------------------------------------------------+ + | Unknown | Indicates initial service status after the background system restarts. | + +-----------------+------------------------------------------------------------------------+ + + - :ref:`Table 2 ` describes the service health status. + + .. _mrs_01_0517__table1972816146524: + + .. table:: **Table 2** Service health status + + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +===================+====================================================================================================================================================================+ + | Good | Indicates that all role instances in the service are running properly. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Faulty | Indicates that the running status of at least one role instance is **Faulty** or the status of the service on which the current service depends is abnormal. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Unknown | Indicates that all role instances in the service are in the **Unknown** state. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Restoring | Indicates that the background system is restarting the service. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Partially Healthy | Indicates that the status of the service on which the service depends is abnormal, and APIs related to the abnormal service cannot be invoked by external systems. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + - :ref:`Table 3 ` describes the service health status. + + .. _mrs_01_0517__table1172913145524: + + .. table:: **Table 3** Service configuration status + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +=======================+==============================================================================================================================================================+ + | Synchronized | The latest configuration takes effect. | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration expired | The latest configuration does not take effect after the parameter modification. Related services need to be restarted. | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration failed | The communication is incorrect or data cannot be read or written during the parameter configuration. Use **Synchronize Configuration** to rectify the fault. | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuring | Parameters are being configured. | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Unknown | Indicates that configuration status cannot be obtained. | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, the **Service** column is sorted in ascending order. You can click the icon next to **Service**, **Operating Status**, **Health Status**, or **Configuration Status** to change the sorting mode. + +2. Click a specified service in the list to view its status and metric information. +3. Customize and view monitoring graphs. + + a. In the **Charts** area, click **Customize** to customize service monitoring metrics. + b. In **Period** area, select a time of period and click **View** to view the monitoring data within the time period. + +**Managing Role Instances** + +.. note:: + + For versions earlier than MRS 3.x, see :ref:`Managing Role Instances `. + +#. On the MRS cluster details page, click **Components**. In the component list, click the specified service name. + +#. Click **Instances** to view the role status. + + The role instance list contains the Role, Host Name, Management IP Address, Service IP Address, Rack, Running Status, and Configuration Status of each instance. + + - :ref:`Table 4 ` shows the running status of a role instance. + + .. _mrs_01_0517__table1573318141522: + + .. table:: **Table 4** Role instance running status + + +---------------------+----------------------------------------------------------------------------------------------------------+ + | Status | Description | + +=====================+==========================================================================================================+ + | **Good** | Indicates that the instance is running properly. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Bad** | Indicates that the instance cannot run properly. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Decommissioned** | Indicates that the instance is out of service. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Not started** | Indicates that the instance is stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Unknown** | Indicates that the initial status of the instance cannot be detected. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Starting** | Indicates that the instance is being started. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Stopping** | Indicates that the instance is being stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Restoring** | Indicates that an exception may occur in the instance and the instance is being automatically rectified. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Decommissioning** | Indicates that the instance is being decommissioned. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Recommissioning** | Indicates that the instance is being recommissioned. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Failed to start** | Indicates that the service fails to be started. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + | **Failed to stop** | Indicates that the service fails to be stopped. | + +---------------------+----------------------------------------------------------------------------------------------------------+ + + - :ref:`Table 5 ` shows the configuration status of a role instance. + + .. _mrs_01_0517__table07347145524: + + .. table:: **Table 5** Role instance configuration status + + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +===========================+==============================================================================================================================================================+ + | **Synchronized** | The latest configuration takes effect. | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Configuration expired** | The latest configuration does not take effect after the parameter modification. Related services need to be restarted. | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Configuration failed** | The communication is incorrect or data cannot be read or written during the parameter configuration. Use **Synchronize Configuration** to rectify the fault. | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Configuring** | Parameters are being configured. | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Unknown** | Current configuration status cannot be obtained. | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, the **Role** column is sorted in ascending order. You can click the sorting icon next to **Role**, **Host Name**, **OM IP Address**, **Business IP Address**, **Rack**, **Running Status**, or **Configuration Status** to change the sorting mode. + + You can filter out all instances of the same role in the **Role** column. + + You can set search criteria in the role search area by clicking **Advanced Search**, and click **Search** to view specified role information. You can click **Reset** to reset the search criteria. Fuzzy search is supported. + +#. Click the target role instance to view its status and metric information. + +#. Customize and view monitoring graphs. + + a. In the **Charts** area, click **Customize** to customize service monitoring metrics. + b. In **Period** area, select a time of period and click **View** to view the monitoring data within the time period. + +**Managing Hosts** + +.. note:: + + For versions earlier than MRS 3.x, see :ref:`Managing Hosts `. + +#. On the MRS cluster details page, click the **Nodes** tab and expand a node group to view the host status. + + The host list contains the **Node Name**, **IP Address**, **Rack**, **Operating** **Status**, **Health Status**, **CPU Usage**, **Memory Usage**, **Disk Usage**, **Network Speed**, Specification Name, **Specifications** and **AZ**. + + - :ref:`Table 6 ` shows the host operating status. + + .. _mrs_01_0517__table107411314105212: + + .. table:: **Table 6** Host operating status + + +----------+-----------------------------------------------------------------------+ + | Status | Description | + +==========+=======================================================================+ + | Normal | The host and service roles on the host are running properly. | + +----------+-----------------------------------------------------------------------+ + | Isolated | The host is isolated, and the service roles on the host stop running. | + +----------+-----------------------------------------------------------------------+ + + - :ref:`Table 7 ` describes the host health status. + + .. _mrs_01_0517__table1774281415526: + + .. table:: **Table 7** Host health status + + +---------+---------------------------------------------------------------------------------------+ + | Status | Description | + +=========+=======================================================================================+ + | Good | The host can properly send heartbeats. | + +---------+---------------------------------------------------------------------------------------+ + | Bad | The host fails to send heartbeats due to timeout. | + +---------+---------------------------------------------------------------------------------------+ + | Unknown | The host initial status is unknown during the operation of adding or deleting a host. | + +---------+---------------------------------------------------------------------------------------+ + + The nodes are sorted in ascending order by default. You can click **Node Name**, **IP Address**, **Rack**, **Operating** **Status**, **Health Status**, **CPU Usage**, **Memory Usage**, **Disk Usage**, **Network Speed**, **Specification Name**, or **Specifications** to change the sorting mode. + +#. Click the target node in the list to view its status and metric information. + +.. |image1| image:: /_static/images/en-us_image_0000001348737925.png diff --git a/umn/source/managing_clusters/cluster_overview/viewing_and_customizing_cluster_monitoring_metrics.rst b/umn/source/managing_clusters/cluster_overview/viewing_and_customizing_cluster_monitoring_metrics.rst new file mode 100644 index 0000000..f6403d6 --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/viewing_and_customizing_cluster_monitoring_metrics.rst @@ -0,0 +1,106 @@ +:original_name: mrs_01_0515.html + +.. _mrs_01_0515: + +Viewing and Customizing Cluster Monitoring Metrics +================================================== + +MRS cluster nodes are classified into management nodes, control nodes, and data nodes. The change trends of key host monitoring metrics on each type of node can be calculated and displayed as curve charts in reports based on the customized periods. If a host belongs to multiple node types, the metric statistics will be repeatedly collected. + +This section provides overview of MRS clusters and describes how to view, customize, and export node monitoring metrics on MRS Manager. + +.. note:: + + Cluster metrics are monitored periodically. The average historical monitoring interval is about 5 minutes. + +**Method 1** **(applicable to clusters of versions earlier than MRS 3.x):** + +#. Choose **Clusters** > **Active Clusters** and click a cluster name to go to the cluster details page. +#. Click the **Dashboard** tab, you can view the cluster host health status statistics on the lower part of the displayed tab page. +#. To view or export reports of other metrics, click **Access Manager** next to **MRS Manager** in the **Basic Information** area to access the Manager page. For details, see :ref:`Accessing Manager `. +#. On the Manager page, view, customize, and export the node monitoring metric report. For details, see :ref:`Dashboard `. + +**Method 2 (applicable to clusters of MRS 1.9.2 to 2.1.0)** + +#. Log in to the MRS console. +#. Choose **Clusters > Active Clusters** and click a cluster name to go to the cluster details page. +#. In the **Basic Information** area on the **Dashboard** tab page, click **Click to synchronize** on the right side of **IAM User Sync** to synchronize IAM users. +#. After the synchronization is complete, you can view the cluster monitoring metric report on the right of the page. +#. In time range area, specify a period to view monitoring data. The options are as follows: + + - Last 1 hour + - Last 3 hours + - Last 12 hours + - Last 24 hours + - Recent 7 days + - Recent 30 days + - Customize: You can customize the period for viewing monitoring data. + +#. Customize a monitoring metric report. + + a. Click **Customize** and select monitoring metrics to be displayed. + + MRS supports a maximum of 14 monitoring metrics, but at most 12 customized monitoring metrics can be displayed on the page. + + - Cluster Host Health Status + - Cluster Network Read Speed Statistics + - Host Network Read Speed Distribution + - Host Network Write Speed Distribution + - Cluster Disk Write Speed Statistics + - Cluster Disk Usage Statistics + - Cluster Disk Information + - Host Disk Usage Statistics + - Cluster Disk Read Speed Statistics + - Cluster Memory Usage Statistics + - Host Memory Usage Distribution + - Cluster Network Write Speed Statistics + - Host CPU Usage Distribution + - Cluster CPU Usage Statistics + + b. Click **OK** to save the selected monitoring metrics for display. + + .. note:: + + Click **Clear** to cancel all the selected monitoring metrics in a batch. + +#. Export a monitoring report. + + a. Select a period. The options are as follows: + + - Last 1 hour + - Last 3 hours + - Last 12 hours + - Last 24 hours + - Recent 7 days + - Recent 30 days + - Customize: You can customize the period for viewing monitoring data. + + b. Click **Export**. MRS will generate a report about the selected monitoring metrics in a specified time of period. Save the report. + +**Method 3: (applicable to MRS 3.x clusters)** + +#. Log in to the MRS console. +#. Choose **Clusters > Active Clusters** and click a cluster name to go to the cluster details page. +#. In the **Basic Information** area on the **Dashboard** tab page, click **Click to synchronize** on the right side of **IAM User Sync** to synchronize IAM users. +#. After the synchronization is complete, you can view the cluster monitoring metric report on the right of the page. +#. In time range area, specify a period to view monitoring data. The options are as follows: + + - Last 1 hour + - Last 3 hours + - Last 12 hours + - Last 24 hours + - Recent 7 days + - Recent 30 days + - Customize: You can customize the period for viewing monitoring data. + +#. Customize a monitoring metric report. + + a. Click **Customize** and select monitoring metrics to be displayed. + + At most 12 customized monitoring metrics can be displayed on the page. + + b. Click **OK** to save the selected monitoring metrics for display. + + .. note:: + + Click **Clear** to cancel all the selected monitoring metrics in a batch. diff --git a/umn/source/managing_clusters/cluster_overview/viewing_basic_cluster_information.rst b/umn/source/managing_clusters/cluster_overview/viewing_basic_cluster_information.rst new file mode 100644 index 0000000..9feeb43 --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/viewing_basic_cluster_information.rst @@ -0,0 +1,177 @@ +:original_name: en-us_topic_0012808231.html + +.. _en-us_topic_0012808231: + +Viewing Basic Cluster Information +================================= + +You can monitor and manage the clusters you have created. Choose **Clusters > Active Clusters**. Select a cluster and click its name to go to the cluster details page. On the displayed page, view the basic configuration and node information of the cluster. + +.. note:: + + On the MRS console, operations performed on an ECS cluster are basically the same as those performed on a BMS cluster. This document describes operations on an ECS cluster. If operations on the two clusters differ, the operations will be described separately. + +On the cluster details page, click **Dashboard**. :ref:`Table 1 ` describes the parameters on the **Dashboard** tab page. + +.. _en-us_topic_0012808231__table664643516164: + +.. table:: **Table 1** Basic cluster information + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=========================================================================================================================================================================================================================================================================================================+ + | Cluster Name | Name of a cluster. Set this parameter when creating a cluster. Click |image1| to change the cluster name. | + | | | + | | For versions earlier than MRS 3.x, only the cluster name displayed on the MRS management console is changed, while the cluster name on MRS Manager is not changed synchronously. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Status | Cluster status. For details, see :ref:`Table 1 `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MRS Manager | Portal for the Manager page. | + | | | + | | - For MRS 3.\ *x* or later, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. | + | | - For versions from MRS 1.9.2 to MRS 2.1.0, you need to bind an EIP and add a security group rule as prompted before accessing the MRS Manager page. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. | + | | - For versions earlier than MRS 1.9.2, you can access MRS Manager from normal clusters. For details about how to access MRS Manager from security clusters, see :ref:`Access Using a Windows ECS `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Version | MRS version information. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster ID | Unique identifier of a cluster, which is automatically assigned when a cluster is created. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Created | Time when a cluster is created. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | AZ | Availability zone (AZ) in the region of a cluster, which is set when a cluster is created. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Subnet | Subnet selected during cluster creation. | + | | | + | | If the subnet IP addresses are insufficient, click **Change Subnet** to switch to another subnet in the same VPC of the current cluster to obtain more available subnet IP addresses. Changing a subnet does not affect the IP addresses and subnets of existing nodes. | + | | | + | | A subnet provides dedicated network resources that are isolated from other networks, improving network security. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | VPC | VPC selected during cluster creation. | + | | | + | | A VPC is a secure, isolated, and logical network environment. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Elastic IP (EIP) | After binding an EIP to an MRS cluster, you can use the EIP to access the Manager web UI of the cluster. | + | | | + | | .. note:: | + | | | + | | This parameter is valid only in MRS 1.9.2 or later. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OBS Permission Control | Click **Manage** and modify the mapping between MRS users and OBS permissions. For details, see :ref:`Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Connection | Click **Manage** to view the data connection type associated with the cluster. For details, see :ref:`Configuring Data Connections `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Agency | Click **Manage Agency** to bind or modify an agency for the cluster. | + | | | + | | An agency allows ECS or BMS to manage MRS resources. You can configure an agency of the ECS type to automatically obtain the AK/SK to access OBS. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. | + | | | + | | The **MRS_ECS_DEFAULT_AGENCY** agency has the OBS OperateAccess permission of OBS and the CES FullAccess (for users who have enabled fine-grained policies), CES Administrator, and KMS Administrator permissions in the region where the cluster is located. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cluster Manager IP Address | Floating IP address for accessing Manager. | + | | | + | | .. note:: | + | | | + | | - The cluster manager IP address is displayed on the **Basic Information** page of the cluster with Kerberos authentication enabled instead of the cluster with Kerberos authentication disabled. | + | | - This parameter is valid only in versions earlier than MRS 1.9.2. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Key Pair | Name of a key pair. Set this parameter when creating a cluster. | + | | | + | | If the login mode is set to password during cluster creation, this parameter is not displayed. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kerberos Authentication | Whether to enable Kerberos authentication when logging in to Manager. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Logging | Used to collect logs about cluster creation and scaling failures. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Security Group | Security group name of the cluster. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Disk Key Name | Name of the key used to encrypt data disks. To manage the used keys, log in to the key management console. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Disk Key ID | ID of the key used to encrypt data disks. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | IAM User Synchronization | IAM user information can be synchronized to an MRS cluster for cluster management. For details, see :ref:`Synchronizing IAM Users to MRS `. | + | | | + | | .. note:: | + | | | + | | The **Components**, **Tenants**, and **Backups & Restorations** tab pages on the cluster details page can be used only after users are synchronized. After clusters of MRS 3.x are synchronized, you can use the **Component Management** function. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Secure Communications | Used to display the security authorization status. You can click |image2| to enable or disable security authorization. Disabling security authorization brings high risks. Exercise caution when performing this operation. For details, see :ref:`Communication Security Authorization `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** Component versions + + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +======================+========================================================================================================================+ + | Hadoop Version | Displays the Hadoop version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Spark Version | Version of the Spark component. Only clusters of versions earlier than MRS 3.x support this parameter. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | HBase Version | Displays the HBase version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Hive Version | Displays the Hive version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Hue Version | Displays the Hue version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Loader Version | Displays the Loader version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Kafka Version | Displays the Kafka version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Storm Version | Displays the Storm version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Flume Version | Displays the Flume version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Tez Version | Displays the Tez version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Presto Version | Displays the Presto version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | KafkaManager Version | Displays the KafkaManager version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | OpenTSDB Version | Displays the OpenTSDB version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Flink Version | Displays the Flink version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Ranger Version | Displays the Ranger version information. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Spark2x Version | Displays the version information about the Spark2x component. Only clusters of MRS 3.x or later support this function. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | Oozie Version | Displays the Oozie version information. Only clusters of MRS 3.x or later support this function. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + | ClickHouse Version | Displays ClickHouse version information. Only clusters of MRS 3.x or later support this function. | + +----------------------+------------------------------------------------------------------------------------------------------------------------+ + +On the cluster details page, click **Nodes**. For details about the node parameters, see :ref:`Table 3 `. + +.. _en-us_topic_0012808231__table41983890161732: + +.. table:: **Table 3** Node information + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===============================================================================================================================================================================================================================+ + | Configure Task Node | Used to add a Task node. For details, see :ref:`Adding a Task Node `. | + | | | + | | For 3.x and later versions, this operation applies only to the analysis cluster, streaming cluster, and hybrid cluster. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add Node Group | This parameter applies only to 3.x and later versions. It applies to customized clusters only and is used to add node groups. For details, see :ref:`Adding a Node Group `. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Group | Node group name. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Type | Node type: | + | | | + | | - **Master**: A Master node in an MRS cluster manages the cluster, assigns MapReduce executable files to Core nodes, traces the execution status of each job, and monitors the DataNode running status. | + | | | + | | - A task node group is a group of nodes where only data roles that do not store data are deployed. The roles include NodeManager, ThriftServer, Thrift1Server, RESTServer, Supervisor, LogViewer, HBaseIndexer, and TagSync. | + | | - If other roles are deployed in the node group in addition to the preceding roles, the node group is the Core node group. | + | | | + | | On the **Nodes** tab page, click |image3| next to a node group name to unfold the nodes contained in the node group. For details about the parameters, see :ref:`Managing Components and Monitoring Hosts `. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Count | Number of nodes in a node group. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | - **Scale Out**: For details, see :ref:`Manually Scaling Out a Cluster `. | + | | - **Scale In**: For details, see :ref:`Manually Scaling In a Cluster `. | + | | - **Auto Scaling**: For details, see :ref:`Configuring an Auto Scaling Rule `. | + | | - **View Roles**: You can view information about roles deployed on the node group. This function applies only to custom clusters of 3.x and later. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296057872.png +.. |image2| image:: /_static/images/en-us_image_0000001296217500.png +.. |image3| image:: /_static/images/en-us_image_0000001349257165.png diff --git a/umn/source/managing_clusters/cluster_overview/viewing_cluster_patch_information.rst b/umn/source/managing_clusters/cluster_overview/viewing_cluster_patch_information.rst new file mode 100644 index 0000000..c29830e --- /dev/null +++ b/umn/source/managing_clusters/cluster_overview/viewing_cluster_patch_information.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_0036.html + +.. _mrs_01_0036: + +Viewing Cluster Patch Information +================================= + +To view patch information about cluster components, you can download the required patch if the cluster component, such as Hadoop or Spark, is faulty. On the MRS console, choose **Clusters > Active Clusters**, select a cluster, and click the cluster name. On the cluster details page that is displayed, upgrade the component and rectify the fault. + +.. note:: + + MRS 3.x does not have patch version information. Therefore, this section is not involved. + +For clusters of versions earlier than MRS 1.7.0, the patch version information is as follows: + +- Patch Name: name of the patch package +- Patch Path: location where the patch is stored in OBS +- Patch Content: patch description + +The patch version information for MRS 1.7.0 to 2.x.x is as follows: + +- Patch Name: name of the patch package +- Published: time when the patch package is released +- Status: patch status +- Patch Description: patch version description +- Operation: patch installation or uninstallation diff --git a/umn/source/managing_clusters/component_management/configuring_customized_service_parameters.rst b/umn/source/managing_clusters/component_management/configuring_customized_service_parameters.rst new file mode 100644 index 0000000..f6284b6 --- /dev/null +++ b/umn/source/managing_clusters/component_management/configuring_customized_service_parameters.rst @@ -0,0 +1,76 @@ +:original_name: mrs_01_0205.html + +.. _mrs_01_0205: + +Configuring Customized Service Parameters +========================================= + +Each component of MRS supports all open-source parameters. MRS supports the modification of some parameters for key application scenarios. Some component clients may not include all parameters with open-source features. To modify the component parameters that are not directly supported by MRS, you can add new parameters for components by using the configuration customization function on MRS. Newly added parameters are saved in component configuration files and take effect after restart. + +Impact on the System +-------------------- + +- After the service attributes are configured, the service needs to be restarted. The service cannot be accessed during restart. +- You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. + +Prerequisites +------------- + +- You have understood the meanings of parameters to be added, configuration files that have taken effect, and the impact on components. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Configuring Customized Service Parameters `. + +#. Select the target service from the service list. + +#. Click **Service Configuration**. + +#. In the configuration type drop-down box on the right side, switch **Basic** to **All**. + +#. In the navigation tree, select **Customization**. The customized parameters of the current component are displayed on MRS. + + The configuration files that save the newly added customized parameters are displayed in the **Parameter File** column. Different configuration files may have same open-source parameters. After the parameters in different files are set to different values, whether the configuration takes effect depends on the loading sequence of the configuration files by components. You can customize parameters for services and roles as required. Adding customized parameters for a single role instance is not supported. + +#. Based on the configuration files and parameter functions, locate the row where a specified parameter resides, enter the parameter name supported by the component in the **Parameter** column and enter the parameter value in the **Value** column. + + - You can click |image1| or |image2| to add or delete a customized parameter. You can delete a customized parameter only after you click |image3| for the first time. + - If you want to cancel the modification of a parameter value, click |image4| to restore it. + +#. Click **Save Configuration**, select **Restart the affected services or instances**, and click **OK**. + +Task Example +------------ + +**Configuring Customized Hive Parameters** + +Hive depends on HDFS. By default, Hive accesses the HDFS client. The configuration parameters to take effect are controlled by HDFS in a unified manner. For example, the HDFS parameter **ipc.client.rpc.timeout** affects the RPC timeout period for all clients to connect to the HDFS server. If you need to modify the timeout period for Hive to connect to HDFS, you can use the configuration customization function. After this parameter is added to the **core-site.xml** file of Hive, this parameter can be identified by the Hive service and its configuration overwrites the parameter configuration in HDFS. + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Task Example `. + +#. Choose **Hive** > **Service Configuration**. + +#. In the configuration type drop-down box on the right side, switch **Basic** to **All**. + +#. In the navigation tree on the left, select **Customization** for the Hive service. The system displays the customized service parameters supported by Hive. + +#. In **core-site.xml**, locate the row that contains the **core.site.customized.configs** parameter, enter **ipc.client.rpc.timeout** in the **Parameter** column, and enter a new value in the **Value** column, for example, **150000**. The unit is millisecond. + +#. Click **Save Configuration**, select **Restart the affected services or instances**, and click **OK**. + + **Operation successful** is displayed. Click **Finish**. The service is started successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001349137749.png +.. |image2| image:: /_static/images/en-us_image_0000001297278204.png +.. |image3| image:: /_static/images/en-us_image_0000001349057853.png +.. |image4| image:: /_static/images/en-us_image_0000001295738244.png diff --git a/umn/source/managing_clusters/component_management/configuring_role_instance_parameters.rst b/umn/source/managing_clusters/component_management/configuring_role_instance_parameters.rst new file mode 100644 index 0000000..9e327c3 --- /dev/null +++ b/umn/source/managing_clusters/component_management/configuring_role_instance_parameters.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_0208.html + +.. _mrs_01_0208: + +Configuring Role Instance Parameters +==================================== + +Scenario +-------- + +You can view and modify default role instance configuration on MRS based on site requirements. The configurations can be imported and exported. + +Impact on the System +-------------------- + +You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Modifying Role Instance Parameters +---------------------------------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Configuring Role Instance Parameters `. + +#. Select the target service from the service list. + +#. Click the **Instances** tab. + +#. Click the target role instance from the role instance list. + +#. Click the **Instance Configuration** tab. + +#. Switch **Basic** to **All** from the drop-down list on the right of the page. All configuration parameters of the role instance are displayed in the navigation tree. + +#. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + If you want to cancel the modification of a parameter value, click |image1| to restore it. + +#. Click **Save Configuration**, select **Restart the affected services or instances**, and click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001348737945.png diff --git a/umn/source/managing_clusters/component_management/configuring_service_parameters.rst b/umn/source/managing_clusters/component_management/configuring_service_parameters.rst new file mode 100644 index 0000000..85657d3 --- /dev/null +++ b/umn/source/managing_clusters/component_management/configuring_service_parameters.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_0204.html + +.. _mrs_01_0204: + +Configuring Service Parameters +============================== + +On the MRS console, you can view and modify the default service configurations based on site requirements and export or import the configurations. + +Impact on the System +-------------------- + +- You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. +- The parameters of DBService cannot be modified when only one DBService role instance exists in the cluster. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Modifying Service Parameters +---------------------------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Configuring Service Parameters `. + +#. Select the target service from the service list. + +#. Click **Service Configuration**. + +#. Switch **Basic** to **All**. All configuration parameters of the service are displayed in the navigation tree. The service name and role names are displayed from upper to lower in the navigation tree. + +#. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + If you want to cancel the modification of a parameter value, click |image1| to restore it. + +#. Click **Save Configuration**, select **Restart the affected services or instances**, and click **OK**. + + .. note:: + + To update the queue configuration of Yarn without restarting service, choose **More** > **Refresh Queue** on the **Service Status** tab page to update the queue for the configuration to take effect. + +.. |image1| image:: /_static/images/en-us_image_0000001348737945.png diff --git a/umn/source/managing_clusters/component_management/decommissioning_and_recommissioning_a_role_instance.rst b/umn/source/managing_clusters/component_management/decommissioning_and_recommissioning_a_role_instance.rst new file mode 100644 index 0000000..bb2162e --- /dev/null +++ b/umn/source/managing_clusters/component_management/decommissioning_and_recommissioning_a_role_instance.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_0210.html + +.. _mrs_01_0210: + +Decommissioning and Recommissioning a Role Instance +=================================================== + +Scenario +-------- + +If a Core or Task node is faulty, the cluster status may be displayed as **Abnormal**. In an MRS cluster, data can be stored on different Core nodes. You can decommission the specified role instance on MRS to stop the role instance from providing services. After fault rectification, you can recommission the role instance. + +The following role instances can be decommissioned or recommissioned: + +- DataNode role instance on HDFS +- NodeManager role instance on Yarn +- RegionServer role instance on HBase +- ClickHouseServer role instance on ClickHouse +- Broker role instance on Kafka + +Restrictions: + +- If the number of the DataNodes is less than or equal to that of HDFS copies, decommissioning cannot be performed. If the number of HDFS copies is three and the number of DataNodes is less than four in the system, decommissioning cannot be performed. In this case, an error will be reported and force MRS to exit the decommissioning 30 minutes after MRS attempts to perform the decommissioning. +- If the number of Kafka Broker instances is less than or equal to that of Kafka copies, decommissioning cannot be performed. For example, if the number of Kafka copies is two and the number of nodes is less than three in the system, decommissioning cannot be performed. Instance decommissioning will fail and exit. +- If a role instance is out of service, you must recommission the instance to start it before using it again. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Decommissioning and Recommissioning a Role Instance `. + +#. Click a service in the service list. +#. Click the **Instances** tab. +#. Select an instance. +#. Choose **More** > **Decommission** or **Recommission** to perform the corresponding operation. + + .. note:: + + During the instance decommissioning, if the service corresponding to the instance is restarted in the cluster using another browser, MRS displays a message indicating that the instance decommissioning is stopped, but the **Operating Status** of the instance is displayed as **Started**. In this case, the instance has been decommissioned on the background. You need to decommission the instance again to synchronize the operating status. diff --git a/umn/source/managing_clusters/component_management/exporting_cluster_configuration.rst b/umn/source/managing_clusters/component_management/exporting_cluster_configuration.rst new file mode 100644 index 0000000..58b4890 --- /dev/null +++ b/umn/source/managing_clusters/component_management/exporting_cluster_configuration.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0216.html + +.. _mrs_01_0216: + +Exporting Cluster Configuration +=============================== + +Scenario +-------- + +You can export all configuration data of a cluster using MRS to meet site requirements. The exported configuration data is used to rapidly update service configuration. + +.. note:: + + In **MRS 3.x**, you cannot perform operations in this section on the management console. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +On the cluster details page, choose **Configuration** > **Export Cluster Configuration** in the upper right corner. + +.. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Exporting Configuration Data of a Cluster `. + +The exported file is used to update service configurations. For details, see **Importing Service Configuration Parameters** in :ref:`Configuring Service Parameters `. diff --git a/umn/source/managing_clusters/component_management/index.rst b/umn/source/managing_clusters/component_management/index.rst new file mode 100644 index 0000000..61ca12f --- /dev/null +++ b/umn/source/managing_clusters/component_management/index.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0200.html + +.. _mrs_01_0200: + +Component Management +==================== + +- :ref:`Object Management ` +- :ref:`Viewing Configuration ` +- :ref:`Managing Services ` +- :ref:`Configuring Service Parameters ` +- :ref:`Configuring Customized Service Parameters ` +- :ref:`Synchronizing Service Configuration ` +- :ref:`Managing Role Instances ` +- :ref:`Configuring Role Instance Parameters ` +- :ref:`Synchronizing Role Instance Configuration ` +- :ref:`Decommissioning and Recommissioning a Role Instance ` +- :ref:`Starting and Stopping a Cluster ` +- :ref:`Synchronizing Cluster Configuration ` +- :ref:`Exporting Cluster Configuration ` +- :ref:`Performing Rolling Restart ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + object_management + viewing_configuration + managing_services + configuring_service_parameters + configuring_customized_service_parameters + synchronizing_service_configuration + managing_role_instances + configuring_role_instance_parameters + synchronizing_role_instance_configuration + decommissioning_and_recommissioning_a_role_instance + starting_and_stopping_a_cluster + synchronizing_cluster_configuration + exporting_cluster_configuration + performing_rolling_restart diff --git a/umn/source/managing_clusters/component_management/managing_role_instances.rst b/umn/source/managing_clusters/component_management/managing_role_instances.rst new file mode 100644 index 0000000..91654d7 --- /dev/null +++ b/umn/source/managing_clusters/component_management/managing_role_instances.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_0207.html + +.. _mrs_01_0207: + +Managing Role Instances +======================= + +Scenario +-------- + +You can start a role instance that is in the **Stopped**, **Failed to stop** or **Failed to start** status, stop an unused or abnormal role instance or restart an abnormal role instance to recover its functions. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Managing Role Instances `. + +#. Select the target service from the service list. +#. Click the **Instances** tab. +#. Select the check box on the left of the target role instance. +#. Click **More**, select operations such as **Start Instance**, **Stop Instance**, **Restart Instance**, **Rolling-restart Instance**, or **Delete Instance** based on site requirements. diff --git a/umn/source/managing_clusters/component_management/managing_services.rst b/umn/source/managing_clusters/component_management/managing_services.rst new file mode 100644 index 0000000..beb58fc --- /dev/null +++ b/umn/source/managing_clusters/component_management/managing_services.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0203.html + +.. _mrs_01_0203: + +Managing Services +================= + +You can perform the following operations on MRS: + +- Start the service in the **Stopped**, **Stop Failed**, or **Failed to Start** state to use the service. +- Stop the services or stop abnormal services. +- Restart abnormal services or configure expired services to restore or enable the services. + +Prerequisites +------------- + +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Impact on the System +-------------------- + +- The stateful component cannot be added to the task node group. + +Starting, Stopping, and Restarting a Service +-------------------------------------------- + +#. On the MRS cluster details page, click **Components**. + +#. Locate the row that contains the target service, **Start**, **Stop**, and **Restart** to start, stop, or restart the service. + + Services are interrelated. If a service is started, stopped, and restarted, services dependent on it will be affected. + + The services will be affected in the following ways: + + - If a service is to be started, the lower-layer services dependent on it must be started first. + - If a service is stopped, the upper-layer services dependent on it are unavailable. + - If a service is restarted, the running upper-layer services dependent on it must be restarted. diff --git a/umn/source/managing_clusters/component_management/object_management.rst b/umn/source/managing_clusters/component_management/object_management.rst new file mode 100644 index 0000000..f303a20 --- /dev/null +++ b/umn/source/managing_clusters/component_management/object_management.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_0201.html + +.. _mrs_01_0201: + +Object Management +================= + +MRS contains different types of basic objects. :ref:`Table 1 ` describes these objects. + +.. _mrs_01_0201__table23400575171145: + +.. table:: **Table 1** MRS basic object overview + + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Object | Description | Example | + +==================+===============================================================================+==================================================================================================================+ + | Service | Function set that can complete specific business. | KrbServer service and LdapServer service | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Service instance | Specific instance of a service, usually called service. | KrbServer service | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Service role | Function entity that forms a complete service, usually called role. | KrbServer is composed of the KerberosAdmin role and KerberosServer role. | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Role instance | Specific instance of a service role running on a host. | KerberosAdmin that is running on Host2 and KerberosServer that is running on Host3 | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Host | An ECS running Linux OS. | Host1 to Host5 | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Rack | Physical entity that contains multiple hosts connecting to the same switch. | Rack1 contains Host1 to Host5. | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ + | Cluster | Logical entity that consists of multiple hosts and provides various services. | Cluster1 cluster consists of five hosts (Host1 to Host5) and provides services such as KrbServer and LdapServer. | + +------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/managing_clusters/component_management/performing_rolling_restart.rst b/umn/source/managing_clusters/component_management/performing_rolling_restart.rst new file mode 100644 index 0000000..e610500 --- /dev/null +++ b/umn/source/managing_clusters/component_management/performing_rolling_restart.rst @@ -0,0 +1,147 @@ +:original_name: mrs_01_0628.html + +.. _mrs_01_0628: + +Performing Rolling Restart +========================== + +After modifying the configuration items of a big data component, you need to restart the corresponding service to make new configurations take effect. If you use a normal restart mode, all services or instances are restarted concurrently, which may cause service interruption. To ensure that services are not affected during service restart, you can restart services or instances in batches by rolling restart. For instances in active/standby mode, a standby instance is restarted first and then an active instance is restarted. Rolling restart takes longer than normal restart. + +:ref:`Table 1 ` provides services and instances that support or do not support rolling restart in the MRS cluster. + +.. _mrs_01_0628__en-us_topic_0173397702_table054720341161: + +.. table:: **Table 1** Services and instances that support or do not support rolling restart + + ========= ================ ================================== + Service Instance Whether to Support Rolling Restart + ========= ================ ================================== + HDFS NameNode Yes + \ Zkfc + \ JournalNode + \ HttpFS + \ DataNode + Yarn ResourceManager Yes + \ NodeManager + Hive MetaStore Yes + \ WebHCat + \ HiveServer + Mapreduce JobHistoryServer Yes + HBase HMaster Yes + \ RegionServer + \ ThriftServer + \ RESTServer + Spark JobHistory Yes + \ JDBCServer + \ SparkResource No + Hue Hue No + Tez TezUI No + Loader Sqoop No + Zookeeper Quorumpeer Yes + Kafka Broker Yes + \ MirrorMaker No + Flume Flume Yes + \ MonitorServer + Storm Nimbus Yes + \ UI + \ Supervisor + \ Logviewer + ========= ================ ================================== + +Restrictions +------------ + +- Perform a rolling restart during off-peak hours. + + - Otherwise, a rolling restart failure may occur. For example, if the throughput of Kafka is high (over 100 MB/s) during the Kafka rolling restart, the Kafka rolling restart may fail. + - For example, if the requests per second of each RegionServer on the native interface exceed 10,000 during the HBase rolling restart, you need to increase the number of handles to prevent a RegionServer restart failure caused by heavy loads during the restart. + +- Before the restart, check the number of current requests of HBase. If the number of requests of each RegionServer on the native interface exceeds 10,000, increase the number of handles to prevent a failure. +- If the number of Core nodes in a cluster is less than six, services may be affected for a short period of time. +- Preferentially perform a rolling instance or service restart and select **Only restart instances whose configurations have expired**. + +Performing a Rolling Service Restart +------------------------------------ + +#. Choose **Clusters** > **Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Components** and select a service for which you want to perform a rolling restart. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Performing a Rolling Service Restart `. + +#. On the **Service Status** tab page, click **More** and select **Rolling-restart Service**. +#. The **Rolling-restart Service** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the service. +#. After the rolling restart task is complete, click **Finish**. + +Performing a Rolling Instance Restart +------------------------------------- + +#. Choose **Clusters** > **Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Components** and select a service for which you want to perform a rolling restart. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Performing a Rolling Instance Restart `. + +#. On the **Instance** tab page, select the instance to be restarted. Click **More** and select **Rolling-restart Instance**. +#. After you enter the administrator password, the **Rolling-restart Instance** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the instance. +#. After the rolling restart task is complete, click **Finish**. + +Perform a Rolling Cluster Restart +--------------------------------- + +#. Choose **Clusters** > **Active Clusters** and click a cluster name to go to the cluster details page. +#. In the upper right corner of the page, choose **Management Operations** > **Perform Rolling Cluster Restart**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Perform a Rolling Cluster Restart `. + +#. The **Rolling-restart Cluster** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the cluster. +#. After the rolling restart task is complete, click **Finish**. + +Rolling Restart Parameter Description +------------------------------------- + +:ref:`Table 2 ` describes rolling restart parameters. + +.. _mrs_01_0628__table817615121520: + +.. table:: **Table 2** Rolling restart parameter description + + +----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==========================================================+================================================================================================================================================================================================================================================================================+ + | Only restart instances whose configurations have expired | Specifies whether to restart only the modified instances in a cluster. | + +----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Node Instances to Be Batch Restarted | Specifies the number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is **1**. The value ranges from 1 to 20. This parameter is valid only for data nodes. | + +----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Interval | Specifies the interval between two batches of instances for rolling restart. The default value is **0**. The value ranges from 0 to 2147483647. The unit is second. | + | | | + | | Note: Setting the batch interval parameter can increase the stability of the big data component process during the rolling restart. You are advised to set this parameter to a non-default value, for example, 10. | + +----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Fault Tolerance Threshold | Specifies the tolerance times when the rolling restart of instances fails to be executed in batches. The default value is **0**, which indicates that the rolling restart task ends after any batch of instances fails to be restarted. The value ranges from 0 to 2147483647. | + +----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Procedure in a Typical Scenario +------------------------------- + +#. Choose **Clusters** > **Active Clusters** and click a cluster name to go to the cluster details page. +#. Click **Components** and select **HBase**. The **HBase** service page is displayed. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Procedure in a Typical Scenario `. + +#. Click the **Service Configuration** tab, and modify an HBase parameter. After the following dialog box is displayed, click **OK** to save the configurations. + + .. note:: + + Do not select **Restart the affected services or instances**. This option indicates a normal restart. If you select this option, all services or instances will be restarted, which may cause service interruption. + +#. After saving the configurations, click **Finish**. +#. Click the **Service Status** tab. +#. On the **Service Status** tab page, click **More** and select **Rolling-restart Service**. +#. After you enter the administrator password, the **Rolling-restart Service** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart. +#. After the rolling restart task is complete, click **Finish**. diff --git a/umn/source/managing_clusters/component_management/starting_and_stopping_a_cluster.rst b/umn/source/managing_clusters/component_management/starting_and_stopping_a_cluster.rst new file mode 100644 index 0000000..779a6ce --- /dev/null +++ b/umn/source/managing_clusters/component_management/starting_and_stopping_a_cluster.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_0214.html + +.. _mrs_01_0214: + +Starting and Stopping a Cluster +=============================== + +A cluster is a collection of service components. You can start or stop all services in a cluster. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +On the cluster details page, choose **Management Operations** > **Start All Components** or **Stop All Components** in the upper right corner to perform the required operation. + +.. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Starting or Stopping a Cluster `. diff --git a/umn/source/managing_clusters/component_management/synchronizing_cluster_configuration.rst b/umn/source/managing_clusters/component_management/synchronizing_cluster_configuration.rst new file mode 100644 index 0000000..80a9d4b --- /dev/null +++ b/umn/source/managing_clusters/component_management/synchronizing_cluster_configuration.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_0215.html + +.. _mrs_01_0215: + +Synchronizing Cluster Configuration +=================================== + +Scenario +-------- + +If **Configuration Status** of all services or some services is **Configuration expired** or **Configuration failed**, synchronize configuration for the cluster or service to restore its configuration status. + +- If all services in the cluster are in the **Configuration failed** status, synchronize the cluster configuration with the background configuration. +- If all services in the cluster are in the **Configuration failed** status, synchronize the service configuration with the background configuration. + + .. note:: + + In **MRS 3.x**, you cannot perform operations in this section on the management console. + +Impact on the System +-------------------- + +After synchronizing cluster configurations, you need to restart the services whose configurations have expired. These services are unavailable during restart. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the cluster details page, choose **Configuration** > **Synchronize Configuration** in the upper right corner. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Synchronizing Cluster Configurations `. + +#. In the displayed dialog box, select **Restart services and instances whose configuration have expired**, and click **OK** to restart the service whose configuration has expired. + + When **Operation successful** is displayed, click **Finish**. The cluster is started successfully. diff --git a/umn/source/managing_clusters/component_management/synchronizing_role_instance_configuration.rst b/umn/source/managing_clusters/component_management/synchronizing_role_instance_configuration.rst new file mode 100644 index 0000000..e86cc1e --- /dev/null +++ b/umn/source/managing_clusters/component_management/synchronizing_role_instance_configuration.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0209.html + +.. _mrs_01_0209: + +Synchronizing Role Instance Configuration +========================================= + +Scenario +-------- + +When **Configuration Status** of a role instance is **Configuration expired** or **Configuration failed**, you can synchronize the configuration data of the role instance with the background configuration. + +Impact on the System +-------------------- + +After synchronizing a role instance configuration, you need to restart the role instance whose configuration has expired. The role instance is unavailable during restart. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Synchronizing Role Instance Configuration `. + +#. Select a service name. +#. Click the **Instances** tab. +#. Click the target role instance from the role instance list. +#. Choose **More** > **Synchronize Configuration** above the role instance status and indicator information. +#. In the dialog box that is displayed, select **Restart the service or instances whose configurations have expired** and click **Yes** to restart the role instance. diff --git a/umn/source/managing_clusters/component_management/synchronizing_service_configuration.rst b/umn/source/managing_clusters/component_management/synchronizing_service_configuration.rst new file mode 100644 index 0000000..f735cbf --- /dev/null +++ b/umn/source/managing_clusters/component_management/synchronizing_service_configuration.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0206.html + +.. _mrs_01_0206: + +Synchronizing Service Configuration +=================================== + +Scenario +-------- + +If **Configuration Status** of some services is **Configuration expired** or **Configuration failed**, synchronize configuration for the cluster or service to restore its configuration status. If all services in the cluster are in the **Configuration failed** state, synchronize the cluster configuration with the background configuration. + +Impact on the System +-------------------- + +After synchronizing service configurations, you need to restart the services whose configurations have expired. These services are unavailable during restart. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Synchronizing Service Configurations `. + +#. Select the target service from the service list. +#. On the Service Status tab page, choose **More** > **Synchronize Configuration**. +#. In the dialog box that is displayed, select **Restart the service or instances whose configurations have expired** and click **Yes** to restart the service. diff --git a/umn/source/managing_clusters/component_management/viewing_configuration.rst b/umn/source/managing_clusters/component_management/viewing_configuration.rst new file mode 100644 index 0000000..0fe8b9c --- /dev/null +++ b/umn/source/managing_clusters/component_management/viewing_configuration.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_0202.html + +.. _mrs_01_0202: + +Viewing Configuration +===================== + +On MRS, you can view the configuration of services (including roles) and role instances. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +- Query service configuration. + + #. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Viewing Configurations `. + + #. Select the target service from the service list. + + #. Click **Service Configuration**. + + #. Switch **Basic** to **All**. All configuration parameters of the service are displayed in the navigation tree. The service name and role names are displayed from upper to lower in the navigation tree. + + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + The parameters under the service nodes and role nodes are service configuration parameters and role configuration parameters respectively. + + #. Select **Non-default** from the **--Select--** drop-down list. The parameters whose values are not default values are displayed. + +- Query role instance configurations. + + #. On the MRS cluster details page, click **Components**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Viewing Configurations `. + + #. Select the target service from the service list. + #. Click the **Instances** tab. + #. Click the target role instance from the role instance list. + #. Click **Instance Configuration**. + #. Switch **Basic** to **All** on the right of the page. All configuration parameters of the role instance are displayed in the navigation tree. + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + #. Select **Non-default** from the **--Select--** drop-down list. The parameters whose values are not default values are displayed. diff --git a/umn/source/managing_clusters/index.rst b/umn/source/managing_clusters/index.rst new file mode 100644 index 0000000..103cefa --- /dev/null +++ b/umn/source/managing_clusters/index.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_0034.html + +.. _mrs_01_0034: + +Managing Clusters +================= + +- :ref:`Logging In to a Cluster ` +- :ref:`Cluster Overview ` +- :ref:`Cluster O&M ` +- :ref:`Managing Nodes ` +- :ref:`Job Management ` +- :ref:`Component Management ` +- :ref:`Alarm Management ` +- :ref:`Patch Management ` +- :ref:`Tenant Management ` +- :ref:`Bootstrap Actions ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + logging_in_to_a_cluster/index + cluster_overview/index + cluster_o&m/index + managing_nodes/index + job_management/index + component_management/index + alarm_management/index + patch_management/index + tenant_management/index + bootstrap_actions/index diff --git a/umn/source/managing_clusters/job_management/configuring_job_notification_rules.rst b/umn/source/managing_clusters/job_management/configuring_job_notification_rules.rst new file mode 100644 index 0000000..a258728 --- /dev/null +++ b/umn/source/managing_clusters/job_management/configuring_job_notification_rules.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_0762.html + +.. _mrs_01_0762: + +Configuring Job Notification Rules +================================== + +MRS uses SMN to offer a publish/subscribe model to achieve one-to-multiple message subscriptions and notifications in a variety of message types (SMSs and emails). You can configure job notification rules to receive notifications immediately upon a job execution success or failure. + +**Procedure** +------------- + +#. Log in to the management console. +#. Click **Service List**. Under **Management & Governance**, click **Simple Message Notification**. +#. Create a topic and add subscriptions to the topic. For details, see :ref:`Configuring Message Notification `. +#. Go to the MRS management console, and click the cluster name to go to the cluster details page. +#. Click the **Alarms** tab, and choose **Notification Rules** > **Add Notification Rule**. +#. Configure a notification rule for sending job execution results to subscribers. + + .. table:: **Table 1** Parameters of adding a notification rule + + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==========================================================================================================+ + | Rule Name | User-defined notification rule name. Only digits, letters, hyphens (-), and underscores (_) are allowed. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + | Message Notification | If you enable this function, subscription messages will be sent to subscribers. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + | Topic Name | Select an existing topic or click **Create Topic** to create a topic. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + | Notification Type | Select **Event**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + | Subscription Items | a. Click |image1| next to **Suggestion**. | + | | b. Click |image2| next to **Manager**. | + | | c. Select **Job Running Succeeded** and **Job Running Failed**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001349137917.png +.. |image2| image:: /_static/images/en-us_image_0000001349137917.png diff --git a/umn/source/managing_clusters/job_management/copying_jobs.rst b/umn/source/managing_clusters/job_management/copying_jobs.rst new file mode 100644 index 0000000..eb76b25 --- /dev/null +++ b/umn/source/managing_clusters/job_management/copying_jobs.rst @@ -0,0 +1,104 @@ +:original_name: mrs_01_0057.html + +.. _mrs_01_0057: + +Copying Jobs +============ + +This section describes how to copy new MRS jobs. Only clusters whose version is MRS 1.7.2 or earlier support job replication. + +Background +---------- + +Currently, all types of jobs except for Spark SQL and Distcp jobs can be copied. + +Procedure +--------- + +#. Log in to the MRS console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. Click **Jobs**. + +#. In the **Operation** column corresponding to the to-be-copied job, choose **More > Copy**. + + The **Copy Job** dialog box is displayed. + +#. Set job parameters, and click **OK**. + + :ref:`Table 1 ` describes job configuration information. + + After being successfully submitted, a job changes to the **Running** state by default. You do not need to manually execute the job. + + .. _mrs_01_0057__table9810376132641: + + .. table:: **Table 1** Job parameters + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS management console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001295898204.png diff --git a/umn/source/managing_clusters/job_management/deleting_a_job.rst b/umn/source/managing_clusters/job_management/deleting_a_job.rst new file mode 100644 index 0000000..6f8ca26 --- /dev/null +++ b/umn/source/managing_clusters/job_management/deleting_a_job.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0058.html + +.. _mrs_01_0058: + +Deleting a Job +============== + +This section describes how to delete an MRS job. After a job is executed, you can delete it if you do not need to view its information. + +Background +---------- + +Jobs can be deleted one after another or in a batch. A deleted job cannot be restored. Therefore, exercise caution when deleting a job. + +Procedure +--------- + +#. Log in to the MRS management console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name. + + The cluster details page is displayed. + +#. Click **Jobs**. + +#. Choose **More > Delete** from the **Operation** in the row of the target job to be deleted. + + In this step, you can only delete one job only. + +#. If you select multiple jobs and click **Delete** on the upper left of the job list. + + You can delete one, multiple, or all jobs. + +.. |image1| image:: /_static/images/en-us_image_0000001296217924.png diff --git a/umn/source/managing_clusters/job_management/index.rst b/umn/source/managing_clusters/job_management/index.rst new file mode 100644 index 0000000..8403274 --- /dev/null +++ b/umn/source/managing_clusters/job_management/index.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0522.html + +.. _mrs_01_0522: + +Job Management +============== + +- :ref:`Introduction to MRS Jobs ` +- :ref:`Running a MapReduce Job ` +- :ref:`Running a SparkSubmit Job ` +- :ref:`Running a HiveSQL Job ` +- :ref:`Running a SparkSql Job ` +- :ref:`Running a Flink Job ` +- :ref:`Running a Kafka Job ` +- :ref:`Viewing Job Configuration and Logs ` +- :ref:`Stopping a Job ` +- :ref:`Copying Jobs ` +- :ref:`Deleting a Job ` +- :ref:`Configuring Job Notification Rules ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction_to_mrs_jobs + running_a_mapreduce_job + running_a_sparksubmit_job + running_a_hivesql_job + running_a_sparksql_job + running_a_flink_job + running_a_kafka_job + viewing_job_configuration_and_logs + stopping_a_job + copying_jobs + deleting_a_job + configuring_job_notification_rules diff --git a/umn/source/managing_clusters/job_management/introduction_to_mrs_jobs.rst b/umn/source/managing_clusters/job_management/introduction_to_mrs_jobs.rst new file mode 100644 index 0000000..07974fa --- /dev/null +++ b/umn/source/managing_clusters/job_management/introduction_to_mrs_jobs.rst @@ -0,0 +1,162 @@ +:original_name: mrs_01_0051.html + +.. _mrs_01_0051: + +Introduction to MRS Jobs +======================== + +An MRS job is the program execution platform of MRS. It is used to process and analyze user data. After a job is created, all job information is displayed on the **Jobs** tab page. You can view a list of all jobs and create and manage jobs. If the **Jobs** tab is not displayed on the cluster details page, submit a job in the background. + +Data sources processed by MRS are from OBS or HDFS. OBS is an object-based storage service that provides you with massive, secure, reliable, and cost-effective data storage capabilities. MRS can process data in OBS directly. You can view, manage, and use data by using the web page of the management control platform or OBS client. In addition, you can use REST APIs independently or integrate APIs to service applications to manage and access data. + +Before creating jobs, upload the local data to OBS for MRS to compute and analyze. MRS allows exporting data from OBS to HDFS for computing and analyzing. After the analyzing and computing are complete, you can store the data in HDFS or export them to OBS. HDFS and OBS can also store the compressed data in the format of **bz2** or **gz**. + +Category +-------- + +An MRS cluster allows creating and managing the following jobs: If a cluster in the **Running** state fails to create a job, check the health status of related components on the cluster management page. For details, see :ref:`Viewing and Customizing Cluster Monitoring Metrics `. + +- MapReduce: provides the capability of processing massive data quickly and in parallel. It is a distributed data processing mode and execution environment. MRS supports the submission of MapReduce JAR programs. +- Spark: a distributed in-memory computing framework. MRS supports SparkSubmit, Spark Script, and Spark SQL jobs. + + - SparkSubmit: You can submit the Spark JAR and Spark Python programs, execute the Spark Application, and compute and process user data. + - SparkScript: You can submit the SparkScript scripts and batch execute Spark SQL statements. + - Spark SQL: You can use Spark SQL statements (similar to SQL statements) to query and analyze user data in real time. + +- Hive: an open-source data warehouse based on Hadoop. MRS allows you to submit HiveScript scripts and execute Hive SQL statements. +- Flink: provides a distributed big data processing engine that can perform stateful computations over both finite and infinite data streams. + +Job List +-------- + +Tasks are listed in chronological order by default in the task list, with the most recent jobs displayed at the top. :ref:`Table 1 ` describes the parameters in the job list. + +.. _mrs_01_0051__table38822211162654: + +.. table:: **Table 1** Job list parameters + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Name/ID | Job name, which is set when a job is created. | + | | | + | | ID is the unique identifier of a job. After a job is added, the system automatically assigns a value to ID. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Username | Name of the user who submits a job. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | The following data types are supported: | + | | | + | | - DistCp: importing and exporting data | + | | - MapReduce | + | | - Spark | + | | - SparkSubmit | + | | - SparkScript | + | | - Spark SQL | + | | - Hive SQL | + | | - HiveScript | + | | - Flink | + | | | + | | .. note:: | + | | | + | | - After importing and exporting files on the **Files** tab page, you can view the DistCp job on the **Jobs** tab page. | + | | - Spark, Hive, and Flink jobs can be added only when the Spark, Hive, and Flink components are selected during cluster creation and the cluster is running. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Job status. | + | | | + | | - Submitted | + | | - Accepted | + | | - Running | + | | - Completed | + | | - Terminated | + | | - Abnormal | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Result | Execution result of a job. | + | | | + | | - Undefined: indicates that the job is being executed. | + | | - **Successful**: indicates that the job has been successfully executed. | + | | - **Killed**: indicates that the job is manually terminated during execution. | + | | - **Failed**: indicates that the job fails to be executed. | + | | | + | | .. note:: | + | | | + | | Once a job has succeeded or failed, you cannot execute it again. However, you can add a job, and set job parameters to submit a job again. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Submitted | Time when a job is submitted. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Ended | Time when a job is completed or manually stopped. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | - Viewing Log: Click **View Log** to view the real-time logs of running jobs. For details, see :ref:`Viewing Job Configuration and Logs `. | + | | - View Details: Click **View Details** to view the detailed configuration information about jobs. For details, see :ref:`Viewing Job Configuration and Logs `. | + | | - More | + | | | + | | - Stop: You can click **Stop** to stop a running job. For details, see :ref:`Stopping a Job `. | + | | - Copy: Click **Copy** to add a job. For details, see :ref:`Copying Jobs `. The function of copying jobs is available only in clusters earlier than MRS 1.9.2. | + | | - Delete: Click **Delete** to delete a job. For details, see :ref:`Deleting a Job `. | + | | - View Result: Click **View Result** to view the execution results of SparkSQL and SparkScript jobs whose status is **Completed** and result is **Successful**. | + | | | + | | .. note:: | + | | | + | | - You cannot stop Spark SQL jobs. | + | | - A deleted job cannot be restored. Therefore, exercise caution when deleting a job. | + | | - If you choose to save job logs to OBS or HDFS, the system compresses and saves the logs to the corresponding path after the job execution is completed. Therefore, after a job execution of this type is completed, the job status is still **Running**. After the log is successfully stored, the job status changes to **Completed**. The log storage duration depends on the log size and takes several minutes. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** Icon description + + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + | Icon | Description | + +===================================+=======================================================================================================+ + | |image1| | Select a time range for job submission to filter jobs submitted in the time range. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + | |image2| | Select a certain job execution result from the drop-down list to display jobs of the status. | + | | | + | | - All statuses: Filter all jobs. | + | | - Successful: Filter jobs that are successfully executed. | + | | - Undefined: Filter jobs that are being executed. | + | | - Killed: Filter jobs that are manually stopped. | + | | - Failed: Filter jobs that fail to be executed. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + | |image3| | Select a certain job type from the drop-down list to display jobs of the type. | + | | | + | | - All types | + | | - MapReduce | + | | - HiveScript | + | | - Distcp | + | | - SparkScript | + | | - Spark SQL | + | | - Hive SQL | + | | - SparkSubmit | + | | - Flink | + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + | |image4| | In the search box, search for a job by setting the corresponding search condition and click |image5|. | + | | | + | | - Job name. | + | | - Job ID. | + | | - Username. | + | | - Queue name. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + | |image6| | Click |image7| to manually refresh the job list. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------+ + +Job Execution Permission Description +------------------------------------ + +For a security cluster with Kerberos authentication enabled, a user needs to synchronize an IAM user before submitting a job on the MRS web UI. After the synchronization is completed, the MRS system generates a user with the same IAM username. Whether a user has the permission to submit jobs depends on the IAM policy bound to the user during IAM synchronization. For details about the job submission policy, see :ref:`Table 1 ` in :ref:`Synchronizing IAM Users to MRS `. + +When a user submits a job that involves the resource usage of a specific component, such as accessing HDFS directories and Hive tables, user **admin** (Manager administrator) must grant the relevant permission to the user. Detailed operations are as follows: + +#. Log in to Manager as user **admin**. +#. Add the role of the component whose permission is required by the user. For details, see :ref:`Creating a Role `. +#. Change the user group to which the user who submits the job belongs and add the new component role to the user group. For details, see :ref:`Related Tasks `. + + .. note:: + + After the component role bound to the user group to which the user belongs is modified, it takes some time for the role permissions to take effect. + +.. |image1| image:: /_static/images/en-us_image_0000001295898328.png +.. |image2| image:: /_static/images/en-us_image_0000001295898368.png +.. |image3| image:: /_static/images/en-us_image_0000001349257505.png +.. |image4| image:: /_static/images/en-us_image_0000001349058029.png +.. |image5| image:: /_static/images/en-us_image_0000001349057965.png +.. |image6| image:: /_static/images/en-us_image_0000001349057929.png +.. |image7| image:: /_static/images/en-us_image_0000001349057929.png diff --git a/umn/source/managing_clusters/job_management/running_a_flink_job.rst b/umn/source/managing_clusters/job_management/running_a_flink_job.rst new file mode 100644 index 0000000..a57f77b --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_flink_job.rst @@ -0,0 +1,272 @@ +:original_name: mrs_01_0527.html + +.. _mrs_01_0527: + +Running a Flink Job +=================== + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This section describes how to submit a Flink job on the MRS management console. Flink jobs are used to submit JAR programs to process streaming data. + +Prerequisites +------------- + +You have uploaded the program packages and data files required for running jobs to OBS or HDFS. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. Click **Create**. The **Create Job** page is displayed. + +#. Set **Type** to **Flink**. Configure Flink job information by referring to :ref:`Table 1 `. + + .. _mrs_01_0527__tf38a01bf69f34c29a25317555fc32b92: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Parameter | (Optional) Used to configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. | + | | | + | | :ref:`Table 2 ` describes the common parameters of a running program. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | (Optional) Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. caution:: | + | | | + | | CAUTION: | + | | If you enter a parameter with sensitive information (such as the login password), the parameter may be exposed in the job details display and log printing. Exercise caution when performing this operation. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 3 ` describes the common parameters of a service. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0527__table15713101071912: + + .. table:: **Table 2** Program parameters + + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | Parameter | Description | Example Value | + +===========+=====================================================================================================================================================================================+======================+ + | -ytm | Memory size of each TaskManager container. (Optional unit. The unit is MB by default.) | 1024 | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | -yjm | Memory size of JobManager container. (Optional unit. The unit is MB by default.) | 1024 | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | -yn | Number of Yarn containers allocated to applications. The value is the same as the number of TaskManagers. | 2 | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | -ys | Number of TaskManager cores. | 2 | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | -ynm | Custom name of an application on Yarn. | test | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + | -c | Class of the program entry point (for example, the **main** or **getPlan()** method). This parameter is required only when the JAR file does not specify the class of its manifest. | com.bigdata.mrs.test | + +-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ + + .. note:: + + For MRS 3.x or later, the **-yn** parameter is not supported. + + .. _mrs_01_0527__table1583911183234: + + .. table:: **Table 3** Service parameters + + +-------------------+----------------------------------------------------+---------------+ + | Parameter | Description | Example Value | + +===================+====================================================+===============+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Log in to the MRS client. + +#. Run the following command to initialize environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + a. Prepare a user for submitting Flink jobs. + + b. Log in to Manager as the newly created user. + + - For MRS 3.x earlier: Log in to Manager of the cluster. Choose **System** > **Manage User**. In the **Operation** column of the row that contains the added user, choose **More** > **Download authentication credential** to locate the row that contains the user. + - For MRS 3.\ *x* or later: Log in to Manager of the cluster. Choose **System** > **Permission** > **Manage User**. On the displayed page, locate the row that contains the added user, click **More** in the **Operation** column, and select **Download authentication credential**. + + c. Decompress the downloaded authentication credential package and copy the **user.keytab** file to the client node, for example, to the **/opt/Bigdata/client/Flink/flink/conf** directory on the client node. If the client is installed on a node outside the cluster, copy the **krb5.conf** file to the **/etc/** directory on this node. + + d. For MRS 3.x or later: In security mode, add the service IP address of the node where the client is installed and floating IP address of Manager to the **jobmanager.web.allow-access-address** configuration item in the **/opt/Bigdata/client/Flink/flink/conf/flink-conf.yaml** file. + + e. Run the following commands to configure security authentication by adding the **keytab** path and username to the **/opt/Bigdata/client/Flink/flink/conf/flink-conf.yaml** configuration file. + + **security.kerberos.login.keytab:** ** + + **security.kerberos.login.principal:** ** + + Example: + + security.kerberos.login.keytab: /opt/Bigdata/client/Flink/flink/conf/user.keytab + + security.kerberos.login.principal: test + + f. Run the following command to perform security hardening in the **bin** directory of the Flink client. Set password to a new password for submitting jobs. + + sh generate_keystore.sh <*password*> + + This script automatically replaces the SSL value in the **/opt/Bigdata/client/Flink/flink/conf/flink-conf.yaml** file. For MRS 3.x or earlier, external SSL is disabled by default in security clusters. To enable external SSL, run this script again after configuration. The configuration parameters do not exist in the default Flink configuration of MRS, if you enable SSL for external connections, you need to add the parameters listed in :ref:`Table 4 `. + + .. _mrs_01_0527__table780265116214: + + .. table:: **Table 4** Parameter description + + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | Parameter | Example Value | Description | + +=======================================+==========================+===========================================================================================+ + | security.ssl.rest.enabled | true | Switch to enable external SSL. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore | ${path}/flink.keystore | Path for storing **keystore**. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore-password | 123456 | Password of the **keystore**. **123456** indicates a user-defined password is required. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | security.ssl.rest.key-password | 123456 | Password of the SSL key. **123456** indicates a user-defined password is required. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore | ${path}/flink.truststore | Path for storing the **truststore**. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore-password | 123456 | Password of the **truststore**. **123456** indicates a user-defined password is required. | + +---------------------------------------+--------------------------+-------------------------------------------------------------------------------------------+ + + .. note:: + + - For MRS 3.x or earlier: The **generate_keystore.sh** script is automatically generated. + + - Perform `authentication and encryption `__. The generated **flink.keystore**, **flink.truststore**, and **security.cookie** files are automatically filled in the corresponding configuration items in **flink-conf.yaml**. + + - For MRS 3.\ *x* or later: You can obtain the values of **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password** using the Manager plaintext encryption API by running the following command: + + **curl -k -i -u : -X POST -HContent-type:application/json -d '{"plainText":""}' 'https://x.x.x.x:28443/web/api/v2/tools/encrypt'**; In the preceding command, <*password*> must be the same as the password used for issuing the certificate, and *x.x.x.x* indicates the floating IP address of Manager in the cluster. + + g. Configure paths for the client to access the **flink.keystore** and **flink.truststore** files. + + - Absolute path: After the script is executed, the file path of **flink.keystore** and **flink.truststore** is automatically set to the absolute path **opt/Bigdata/client/Flink/flink/conf/** in the **flink-conf.yaml** file. In this case, you need to move the **flink.keystore** and **flink.truststore** files from the **conf** directory to this absolute path on the Flink client and Yarn nodes. + - Relative path: Perform the following steps to set the file path of **flink.keystore** and **flink.truststore** to the relative path and ensure that the directory where the Flink client command is executed can directly access the relative paths. + + #. In the **/opt/Bigdata/client/Flink/flink/conf/**\ directory, create a new directory, for example, **ssl**. + + #. Move the **flink.keystore** and **flink.truststore** file to the /**opt/Bigdata/client/Flink/flink/conf/ssl/** directory. + + #. For MRS 3.\ *x* or later: Change the values of the following parameters in the **flink-conf.yaml** file to relative paths: + + .. code-block:: + + security.ssl.keystore: ssl/flink.keystore + security.ssl.truststore: ssl/flink.truststore + + #. For MRS 3.x or earlier: Change the values of the following parameters in the **flink-conf.yaml** file to relative paths: + + .. code-block:: + + security.ssl.internal.keystore: ssl/flink.keystore + security.ssl.internal.truststore: ssl/flink.truststore + + h. If the client is installed on a node outside the cluster, add the following configuration to the configuration file (for example, **/opt/Bigdata/client/Flink/fink/conf/flink-conf.yaml**). Replace **xx.xx.xxx.xxx** with the IP address of the node where the client resides. + + .. code-block:: + + web.access-control-allow-origin: xx.xx.xxx.xxx + jobmanager.web.allow-access-address: xx.xx.xxx.xxx + +#. Run a wordcount job. + + - Normal cluster (Kerberos authentication disabled) + + - Run the following commands to start a session and submit a job in the session: + + .. code-block:: + + yarn-session.sh -nm "session-name" + flink run /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + + - Run the following command to submit a single job on Yarn: + + .. code-block:: + + flink run -m yarn-cluster /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + + - Security cluster (Kerberos authentication enabled) + + - If the **flink.keystore** and **flink.truststore** file are stored in the absolute path: + + - Run the following commands to start a session and submit a job in the session: + + .. code-block:: + + yarn-session.sh -nm "session-name" + flink run /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + + - Run the following command to submit a single job on Yarn: + + .. code-block:: + + flink run -m yarn-cluster /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + + - If the **flink.keystore** and **flink.truststore** file are stored in the relative path: + + - In the same directory of SSL, run the following command to start a session and submit jobs in the session. The SSL directory is a relative path. For example, if the SSL directory is **opt/Bigdata/client/Flink/flink/conf/**, then run the following command in this directory: + + .. code-block:: + + yarn-session.sh -t ssl/ -nm "session-name" + flink run /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + + - Run the following command to submit a single job on Yarn: + + .. code-block:: + + flink run -m yarn-cluster -yt ssl/ /opt/Bigdata/client/Flink/flink/examples/streaming/WordCount.jar + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/managing_clusters/job_management/running_a_hivesql_job.rst b/umn/source/managing_clusters/job_management/running_a_hivesql_job.rst new file mode 100644 index 0000000..19f7525 --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_hivesql_job.rst @@ -0,0 +1,221 @@ +:original_name: mrs_01_0525.html + +.. _mrs_01_0525: + +Running a HiveSQL Job +===================== + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This section describes how to submit a HiveSQL job on the MRS management console. HiveSQL jobs are used to submit SQL statements and script files for data query and analysis. Both SQL statements and scripts are supported. If SQL statements contain sensitive information, use Script to submit them. + +Prerequisites +------------- + +You have uploaded the program packages and data files required for running jobs to OBS or HDFS. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. Click **Create**. The **Create Job** page is displayed. + +#. Configure job information. + + - Set **Type** to **HiveSQL**\ if the cluster version is MRS 1.9.2 or later. Configure other parameters of the HiveSQL job by referring to :ref:`Table 1 `. + - Set **Type** to **Hive Script**\ if the cluster version is earlier than MRS 1.9.2. Configure other parameters of the Hive Script job by referring to :ref:`Table 4 `. + + .. _mrs_01_0525__tf38a01bf69f34c29a25317555fc32b92: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL Type | Submission type of the SQL statement | + | | | + | | - SQL | + | | - Script | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL Statement | This parameter is valid only when **SQL Type** is set to **SQL**. Enter the SQL statement to be executed, and then click **Check** to check whether the SQL statement is correct. If you want to submit and execute multiple statements at the same time, use semicolons (;) to separate them. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL File | This parameter is valid only when **SQL Type** is set to **Script**. The path of the SQL file to be executed must meet the following requirements: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript and HiveScript, the path must end with **.sql**. For MapReduce, the path must end with **.jar**. For Flink and SparkSubmit, the path must end with **.jar** or **.py**. The **.sql**, **.jar**, and **.py** are case-insensitive. | + | | | + | | .. note:: | + | | | + | | For MRS 1.9.2 or later: A file path on OBS can start with **obs://**. To submit jobs in this format, you need to configure permissions for accessing OBS. | + | | | + | | - If the OBS permission control function is enabled during cluster creation, you can use the **obs://** directory without extra configuration. | + | | - If the OBS permission control function is not enabled or is not supported when you create a cluster, configure the function by following instructions in :ref:`Accessing OBS `. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Parameter | (Optional) Used to configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. | + | | | + | | :ref:`Table 2 ` describes the common parameters of a running program. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 3 ` lists the common service configuration parameters. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0525__table15713101071912: + + .. table:: **Table 2** Program parameters + + +------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +============+=================================================================================+==============================================================================================+ + | --hiveconf | Hive service configuration, for example, set the execution engine to MapReduce. | Setting the execution engine to MR: **--hiveconf "hive.execution.engine=mr"** | + +------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+ + | --hivevar | Custom variable, for example, variable ID. | Setting the variable ID: **--hivevar id="123" select \* from test where id = ${hivevar:id}** | + +------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+ + + .. _mrs_01_0525__table1583911183234: + + .. table:: **Table 3** Service parameters + + +-----------------------+----------------------------------------------------+-----------------------+ + | Parameter | Description | Example Value | + +=======================+====================================================+=======================+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-----------------------+----------------------------------------------------+-----------------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-----------------------+----------------------------------------------------+-----------------------+ + | hive.execution.engine | Engine for running a job. | - mr | + | | | - tez | + +-----------------------+----------------------------------------------------+-----------------------+ + + .. _mrs_01_0525__table19597131417176: + + .. table:: **Table 4** Job configuration information + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS management console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +#. Log in to a Master node. For details, see :ref:`Logging In to an ECS `. + +#. Run the following command to initialize environment variables: + + **source /opt/BigData/client/bigdata_env** + + .. note:: + + - In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + + - If you use the client to connect to a specific Hive multi-instance in a scenario where multiple Hive instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this step. For example, load the environment variables of the Hive2 instance. + + **source /opt/BigData/client/\ Hive2/component_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster(normal mode), skip this step. + + **kinit** *MRS* *cluster user* (The user must be in the **hive** user group.) + +#. Run the **beeline** command to connect to HiveServer and run tasks. + + **beeline** + + For clusters in normal mode, run the following commands. If no component service user is specified, the current OS user is used to log in to the HiveServer. + + **beeline -n** *Component service user* + + **beeline -f** *SQL files* (SQLs in the execution files) + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/managing_clusters/job_management/running_a_kafka_job.rst b/umn/source/managing_clusters/job_management/running_a_kafka_job.rst new file mode 100644 index 0000000..9cdb284 --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_kafka_job.rst @@ -0,0 +1,68 @@ +:original_name: mrs_01_0494.html + +.. _mrs_01_0494: + +Running a Kafka Job +=================== + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This topic describes how to generate and consume messages in a Kafka topic. + +Currently, Kafka jobs cannot be submitted on the GUI. You can submit them in the background. + +Submitting a Job in the Background +---------------------------------- + +Query the instance addresses of ZooKeeper and Kafka, and then run the Kafka job. + +**Querying the Instance Address** **(3.x)** + +#. Log in to the MRS console. +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. +#. Go to the FusionInsight Manager page. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. On MRS Manager, choose **Services** > **ZooKeeper** > **Instance** to query the IP addresses of ZooKeeper instances. Record any IP address of a ZooKeeper instance. +#. Choose **Services** > **Kafka** > **Instance** to query the IP addresses of Kafka instances. Record any IP address of a Kafka instance. + +Querying the Instance Address (Versions Earlier Than 3.x) + +#. Log in to the MRS console. +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. +#. On the MRS cluster details page, choose **Components > ZooKeeper > Instance** to query the IP addresses of ZooKeeper instances. Record any IP address of a ZooKeeper instance. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Choose **Services** > **ZooKeeper** > **Instance** to query the IP addresses of ZooKeeper instances. Record any IP address of a ZooKeeper instance. + +#. Choose **Components > Kafka > Instance** to query the IP addresses of Kafka instances. Record any IP address of a Kafka instance. + +**Running a Kafka Job** + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Log in to the Master2 node. For details, see :ref:`Logging In to an ECS `. + +#. Run the following command to initialize environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** **MRS cluster user** + + Example: **kinit admin** + +#. Run the following command to create a Kafka topic: + + **kafka-topics.sh --create --zookeeper --partitions 2 --replication-factor 2 --topic ** + +#. Produce messages in a topic test. + + Run the following command: **kafka-console-producer.sh --broker-list --topic --producer.config /opt/Bigdata/client/Kafka/kafka/config/producer.properties**. + + Input specified information as the messages produced by the producer and then press **Enter** to send the messages. To end message production, press **Ctrl+C** to exit. + +#. Consume messages in the topic test. + + **kafka-console-consumer.sh --topic --bootstrap-server --consumer.config /opt/Bigdata/client/Kafka/kafka/config/consumer.properties** + + .. note:: + + If Kerberos authentication is enabled in the cluster, change the port number 9092 to 21007 when running the preceding two commands. For details, see :ref:`List of Open Source Component Ports `. diff --git a/umn/source/managing_clusters/job_management/running_a_mapreduce_job.rst b/umn/source/managing_clusters/job_management/running_a_mapreduce_job.rst new file mode 100644 index 0000000..ea94250 --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_mapreduce_job.rst @@ -0,0 +1,215 @@ +:original_name: mrs_01_0052.html + +.. _mrs_01_0052: + +Running a MapReduce Job +======================= + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This section describes how to submit a MapReduce job on the MRS management console. MapReduce jobs are used to submit JAR programs to quickly process massive amounts of data in parallel and create a distributed data processing and execution environment. + +If the job and file management functions are not supported on the cluster details page, submit the jobs in the background. + +Prerequisites +------------- + +You have uploaded the program packages and data files required for running jobs to OBS or HDFS. + +Before you upload the program packages and data files to OBS, you need to create an OBS agency and bind it to the MRS cluster. For details, see :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) `. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. Click **Create**. The **Create Job** page is displayed. + + .. note:: + + If the IAM username contains spaces (for example, **admin 01**), a job cannot be created. + +#. In **Type**, select **MapReduce**. Configure other job information. + + - Configure MapReduce job information by referring to :ref:`Table 1 `\ if the cluster version is MRS 1.9.2 or later. + - Configure MapReduce job information by referring to :ref:`Table 3 ` if the cluster version is earlier than MRS 1.9.2. + + .. _mrs_01_0052__table2037463920278: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript and HiveScript, the path must end with **.sql**. For MapReduce, the path must end with **.jar**. For Flink and SparkSubmit, the path must end with **.jar** or **.py**. The **.sql**, **.jar**, and **.py** are case-insensitive. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | (Optional) It is the key parameter for program execution. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Program class name* *Data input path* *Data output path* | + | | | + | | - Program class name: It is specified by a function in your program. MRS is responsible for transferring parameters only. | + | | | + | | - Data input path: Click **HDFS** or **OBS** to select a path or manually enter a correct path. | + | | | + | | - Data output path: Enter a directory that does not exist. | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. caution:: | + | | | + | | CAUTION: | + | | If you enter a parameter with sensitive information (such as the login password), the parameter may be exposed in the job details display and log printing. Exercise caution when performing this operation. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 2 ` lists the common service configuration parameters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0052__table12538926589: + + .. table:: **Table 2** **Service Parameter** parameters + + +-------------------+----------------------------------------------------+---------------+ + | Parameter | Description | Example Value | + +===================+====================================================+===============+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + + .. _mrs_01_0052__table13750103511814: + + .. table:: **Table 3** Job configuration information + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS management console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. (Supported only in MRS 1.8.10 and earlier versions) | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Log in to a Master node. For details, see :ref:`Logging In to an ECS `. + +#. Run the following command to initialize environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** **MRS cluster user** + + Example: **kinit admin** + +#. Run the following command to copy the program in the OBS file system to the Master node in the cluster: + + **hadoop fs -Dfs.obs.access.key=AK -Dfs.obs.secret.key=SK -copyToLocal source_path.jar target_path.jar** + + Example: **hadoop fs -Dfs.obs.access.key=XXXX -Dfs.obs.secret.key=XXXX -copyToLocal "obs://mrs-word/program/hadoop-mapreduce-examples-XXX.jar" "/home/omm/hadoop-mapreduce-examples-XXX.jar"** + + You can log in to OBS Console using AK/SK. To obtain AK/SK information, click the username in the upper right corner of the management console and choose **My Credentials** > **Access Keys**. + +#. Run the following command to submit a wordcount job. If data needs to be read from OBS or outputted to OBS, the AK/SK parameters need to be added. + + **source /opt/Bigdata/client/bigdata_env;hadoop jar execute_jar wordcount input_path output_path** + + Example: **source /opt/Bigdata/client/bigdata_env;hadoop jar /home/omm/hadoop-mapreduce-examples-XXX.jar wordcount -Dfs.obs.access.key=XXXX -Dfs.obs.secret.key=XXXX "obs://mrs-word/input/*" "obs://mrs-word/output/"** + + In the preceding command, **input_path** indicates a path for storing job input files on OBS. **output_path** indicates a path for storing job output files on OBS and needs to be set to a directory that does not exist + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/managing_clusters/job_management/running_a_sparksql_job.rst b/umn/source/managing_clusters/job_management/running_a_sparksql_job.rst new file mode 100644 index 0000000..637f118 --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_sparksql_job.rst @@ -0,0 +1,232 @@ +:original_name: mrs_01_0526.html + +.. _mrs_01_0526: + +Running a SparkSql Job +====================== + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This section describes how to submit a SparkSQL job on the MRS console. SparkSQL jobs are used for data query and analysis. Both SQL statements and scripts are supported. If SQL statements contain sensitive information, use Spark Script to submit them. + +Prerequisites +------------- + +You have uploaded the program packages and data files required for running jobs to OBS or HDFS. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. For clusters of MRS 1.9.2 or later: Click **Create**. On the displayed **Create Job** page, set **Type** to **SparkSql** and configure SparkSql job information by referring to :ref:`Table 1 `. + + .. _mrs_01_0526__table063345817183: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL Type | Submission type of the SQL statement | + | | | + | | - SQL | + | | - Script | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL Statement | This parameter is valid only when **SQL Type** is set to **SQL**. Enter the SQL statement to be executed, and then click **Check** to check whether the SQL statement is correct. If you want to submit and execute multiple statements at the same time, use semicolons (;) to separate them. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SQL File | This parameter is valid only when **SQL Type** is set to **Script**. The path of the SQL file to be executed must meet the following requirements: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript and HiveScript, the path must end with **.sql**. For MapReduce, the path must end with **.jar**. For Flink and SparkSubmit, the path must end with **.jar** or **.py**. The **.sql**, **.jar**, and **.py** are case-insensitive. | + | | | + | | .. note:: | + | | | + | | For clusters of MRS 1.9.2or later: A file path on OBS can start with **obs://**. To submit jobs in this format, you need to configure permissions for accessing OBS. | + | | | + | | - If the OBS permission control function is enabled during cluster creation, you can use the **obs://** directory without extra configuration. | + | | - If the OBS permission control function is not enabled or is not supported when you create a cluster, configure the function by following instructions in :ref:`Accessing OBS `. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Parameter | (Optional) Used to configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. | + | | | + | | :ref:`Table 2 ` describes the common parameters of a running program. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 3 ` lists the common service configuration parameters. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0526__table15713101071912: + + .. table:: **Table 2** Program parameters + + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | Parameter | Description | Example Value | + +===================+==============================================================================================================+==========================+ + | --conf | Task configuration items to be added. | spark.executor.memory=2G | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | --driver-memory | Running memory of a driver. | 2G | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | --num-executors | Number of executors to be started. | 5 | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | --executor-cores | Number of executor cores. | 2 | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | --jars | Additional dependency packages of a task, which is used to add the external dependency packages to the task. | ``-`` | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + | --executor-memory | Executor memory. | 2G | + +-------------------+--------------------------------------------------------------------------------------------------------------+--------------------------+ + + .. _mrs_01_0526__table1583911183234: + + .. table:: **Table 3** Service parameters + + +-------------------+----------------------------------------------------+---------------+ + | Parameter | Description | Example Value | + +===================+====================================================+===============+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + +#. For clusters of MRS 1.9.2 or earlier: Perform the following operations to create SparkScript and Spark SQL jobs. + + - SparkScript job: On the **Jobs** tab page, click **Create**. On the displayed **Create Job** page, set **Type** to **SparkScript** and configure job information by referring to :ref:`Table 4 `. + + - Spark SQL job: Click the **Spark SQL** tab, add SQL statements, and submit the SQL statements after a check. + + .. _mrs_01_0526__table1444481515202: + + .. table:: **Table 4** Job configuration information + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Create a user for submitting jobs. For details, see :ref:`Creating a User `. + + In this example, a machine-machine user used in the user development scenario has been created, and user groups (**hadoop** and **supergroup**), the primary group (**supergroup**), and role permissions (**System_administrator** and **default**) have been correctly assigned to the user. + +#. .. _mrs_01_0526__li145131249134814: + + Download the authentication credential. + + - For clusters of MRS 3.\ *x* or later, log in to FusionInsight Manager and choose **System** > **Permission** > **User**. In the **Operation** column of the newly created user, choose **More** > **Download Authentication Credential**. + - For clusters whose version is earlier than MRS 3.\ *x*, log in to MRS Manager and choose **System** > **Manage User**. In the **Operation** column of the newly created user, choose **More** > **Download Authentication Credential**. + +#. Log in to the node where the Spark client is located, upload the user authentication credential created in :ref:`2 ` to the **/opt** directory of the cluster, and run the following command to decompress the package: + + **tar -xvf MRSTest \_**\ *xxxxxx*\ **\_keytab.tar** + + After the decompression, you obtain the **user.keytab** and **krb5.conf** files. + +#. Before performing operations on the cluster, run the following commands: + + **source /opt/Bigdata/client/bigdata_env** + + **cd $SPARK_HOME** + +#. Open the **spark-sql** CLI and run the following SQL statement: + + **./bin/spark-sql --conf spark.yarn.principal=MRSTest --conf spark.yarn.keytab=/opt/user.keytab** + + To execute the SQL file, you need to upload the SQL file (for example, to the **/opt/** directory). After the file is uploaded, run the following command: + + **./bin/spark-sql --conf spark.yarn.principal=MRSTest --conf spark.yarn.keytab=/opt/user.keytab -f /opt/script.sql** + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/managing_clusters/job_management/running_a_sparksubmit_job.rst b/umn/source/managing_clusters/job_management/running_a_sparksubmit_job.rst new file mode 100644 index 0000000..433d457 --- /dev/null +++ b/umn/source/managing_clusters/job_management/running_a_sparksubmit_job.rst @@ -0,0 +1,242 @@ +:original_name: mrs_01_0524.html + +.. _mrs_01_0524: + +Running a SparkSubmit Job +========================= + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. This section describes how to submit a Spark job on the MRS console. + +Prerequisites +------------- + +You have uploaded the program packages and data files required for running jobs to OBS or HDFS. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. Click **Create**. The **Create Job** page is displayed. + +#. Configure job information. + + - Set **Type** to **SparkSubmit** if the cluster version is MRS 1.9.2 or later. Configure other parameters of the SparkSubmit job by referring to :ref:`Table 1 `. + - Set **Type** to **Spark**\ if the cluster version is earlier than MRS 1.9.2. Configure other parameters of the Spark job by referring to :ref:`Table 4 `. + + .. _mrs_01_0524__tf38a01bf69f34c29a25317555fc32b92: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript and HiveScript, the path must end with **.sql**. For MapReduce, the path must end with **.jar**. For Flink and SparkSubmit, the path must end with **.jar** or **.py**. The **.sql**, **.jar**, and **.py** are case-insensitive. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Parameter | (Optional) Used to configure optimization parameters such as threads, memory, and vCPUs for the job to optimize resource usage and improve job execution performance. | + | | | + | | :ref:`Table 2 ` describes the common parameters of a running program. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | (Optional) Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. caution:: | + | | | + | | CAUTION: | + | | If you enter a parameter with sensitive information (such as the login password), the parameter may be exposed in the job details display and log printing. Exercise caution when performing this operation. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 3 ` lists the common service configuration parameters. | + | | | + | | .. note:: | + | | | + | | If you need to run a long-term job, such as SparkStreaming, and access OBS, you need to use **Service Parameter** to import the AK/SK for accessing OBS. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0524__table4154959181519: + + .. table:: **Table 2** Program parameters + + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +==================================+==========================================================================================================+===================================================================================================================+ + | --conf | Add the task configuration items. | spark.executor.memory=2G | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --driver-memory | Set the running memory of driver. | 2G | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --num-executors | Set the number of executors to be started. | 5 | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --executor-cores | Set the number of executor cores. | 2 | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --class | Set the main class of a task. | org.apache.spark.examples.SparkPi | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --files | Upload files to a task. The files can be custom configuration files or some data files from OBS or HDFS. | ``-`` | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --jars | Upload additional dependency packages of a task to add the external dependency packages to the task. | ``-`` | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --executor-memory | Set executor memory. | 2G | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | --conf spark-yarn.maxAppAttempts | Control the number of AM retries. | If this parameter is set to **0**, retry is not allowed. If this parameter is set to **1**, one retry is allowed. | + +----------------------------------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0524__table14155459121519: + + .. table:: **Table 3** **Service Parameter** parameters + + +-------------------+----------------------------------------------------+---------------+ + | Parameter | Description | Example Value | + +===================+====================================================+===============+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + + .. _mrs_01_0524__table14159155912157: + + .. table:: **Table 4** Job configuration information + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Create a user for submitting jobs. For details, see :ref:`Creating a User `. + + In this example, a machine-machine user used in the user development scenario has been created, and user groups (**hadoop** and **supergroup**), the primary group (**supergroup**), and role permissions (**System_administrator** and **default**) have been correctly assigned to the user. + +#. .. _mrs_01_0524__li145131249134814: + + Download the authentication credential. + + - For clusters of MRS 3.\ *x* or later, log in to FusionInsight Manager and choose **System** > **Permission** > **User**. In the **Operation** column of the newly created user, choose **More** > **Download Authentication Credential**. + - For clusters whose version is earlier than MRS 3.\ *x*, log in to MRS Manager and choose **System** > **Manage User**. In the **Operation** column of the newly created user, choose **More** > **Download Authentication Credential**. + +#. Upload JAR files related to the job to the cluster. In this example, the sample JAR file built in Spark is used. It is stored in **$SPARK_HOME/examples/jars**. + +#. Upload the authentication credential of the user created in :ref:`2 ` to the **/opt** directory of the cluster and run the following command to decompress the credential: + + **tar -xvf MRSTest \_\ xxxxxx\ \_keytab.tar** + + You will obtain two files: **user.keytab** and **krb5.conf**. + +#. Before performing operations on the cluster, run the following commands: + + **source /opt/Bigdata/client/bigdata_env** + + **cd $SPARK_HOME** + +#. Run the following command to submit the Spark job: + + **./bin/spark-submit --master yarn --deploy-mode client --conf spark.yarn.principal=MRSTest --conf spark.yarn.keytab=/opt/user.keytab --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.3.2-mrs-2.0.jar 10** + + Parameter description: + + a. Computing capability of Yarn, which specifies that the job is submitted in client mode. + b. Configuration item of the Spark job. The authentication file and username are transferred here. + c. **spark.yarn.principal**: user created in step 1 + d. **spark.yarn.keytab**: keytab file used for authentication + e. *xx*\ **.jar**: JAR file used by the job + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/managing_clusters/job_management/stopping_a_job.rst b/umn/source/managing_clusters/job_management/stopping_a_job.rst new file mode 100644 index 0000000..3709a42 --- /dev/null +++ b/umn/source/managing_clusters/job_management/stopping_a_job.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0056.html + +.. _mrs_01_0056: + +Stopping a Job +============== + +This section describes how to stop running MRS jobs. + +Background +---------- + +You cannot stop Spark SQL jobs. After a job is stopped, its status changes to **Terminated** and the job cannot be executed again. + +Procedure +--------- + +#. Log in to the MRS management console. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name. + + The cluster details page is displayed. + +#. Click **Jobs**. + +#. Select a running job, and choose **More > Stop** in the **Operation** column. + + The job status changes from **Running** to **Terminated**. diff --git a/umn/source/managing_clusters/job_management/viewing_job_configuration_and_logs.rst b/umn/source/managing_clusters/job_management/viewing_job_configuration_and_logs.rst new file mode 100644 index 0000000..510897a --- /dev/null +++ b/umn/source/managing_clusters/job_management/viewing_job_configuration_and_logs.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0055.html + +.. _mrs_01_0055: + +Viewing Job Configuration and Logs +================================== + +This section describes how to view job configuration and logs. + +Background +---------- + +- You can view configuration information of all jobs. + +- You can only view logs of running jobs. + + Because logs of Spark SQL and DistCp jobs are not in the background, you cannot view logs of running Spark SQL and DistCp jobs. + +Procedure +--------- + +#. Log in to the MRS management console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. Click **Jobs**. + +#. In the **Operation** column of the job to be viewed, click **View Details**. + + In the **View Details** window that is displayed, configuration of the selected job is displayed. + +#. Select a running job, and click **View Log** in the **Operation** column. + + In the new page that is displayed, real-time log information of the job is displayed. + + Each tenant can submit and view 10 jobs concurrently. + +.. |image1| image:: /_static/images/en-us_image_0000001295738508.png diff --git a/umn/source/managing_clusters/logging_in_to_a_cluster/determining_active_and_standby_management_nodes_of_manager.rst b/umn/source/managing_clusters/logging_in_to_a_cluster/determining_active_and_standby_management_nodes_of_manager.rst new file mode 100644 index 0000000..ff2427e --- /dev/null +++ b/umn/source/managing_clusters/logging_in_to_a_cluster/determining_active_and_standby_management_nodes_of_manager.rst @@ -0,0 +1,55 @@ +:original_name: mrs_01_0086.html + +.. _mrs_01_0086: + +Determining Active and Standby Management Nodes of Manager +========================================================== + +This section describes how to determine the active and standby management nodes of Manager on the Master1 node. + +Background +---------- + +You can log in to other nodes in the cluster from the Master node. After logging in to the Master node, you can determine the active and standby management nodes of Manager and run commands on corresponding management nodes. + +In active/standby mode, a switchover can be implemented between Master1 and Master2. For this reason, Master1 may not be the active management node for Manager. + +Procedure +--------- + +#. Confirm the Master nodes of an MRS cluster. + + a. In the navigation tree of the MRS management console, choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. View basic information of the specified cluster. + b. On the **Nodes** tab page, view the node name. The node that contains **master1** in its name is the Master1 node. The node that contains **master2** in its name is the Master2 node. + +#. Determine the active and standby Manager management nodes. + + a. Remotely log in to the Master1 node. For details, see :ref:`Logging In to an ECS `. + + b. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + c. Run the following command to identify the active and standby management nodes: + + For versions earlier than MRS 3.\ *x*, run the **sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh** command. + + For MRS 3.\ *x* or later: Run the **sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh** command. + + In the command output, the node whose **HAActive** is **active** is the active management node (mgtomsdat-sh-3-01-1 in the following example), and the node whose **HAActive** is **standby** is the standby management node (mgtomsdat-sh-3-01-2 in the following example). + + .. code-block:: + + Ha mode + double + NodeName HostName HAVersion StartTime HAActive HAAllResOK HARunPhase + 192-168-0-30 mgtomsdat-sh-3-01-1 V100R001C01 2014-11-18 23:43:02 active normal Actived + 192-168-0-24 mgtomsdat-sh-3-01-2 V100R001C01 2014-11-21 07:14:02 standby normal Deactived + + .. note:: + + If the Master1 node to which you have logged in is the standby management node and you need to log in to the active management node, run the following command: + + **ssh** *IP address of Master2 node* diff --git a/umn/source/managing_clusters/logging_in_to_a_cluster/index.rst b/umn/source/managing_clusters/logging_in_to_a_cluster/index.rst new file mode 100644 index 0000000..1a9ce31 --- /dev/null +++ b/umn/source/managing_clusters/logging_in_to_a_cluster/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0082.html + +.. _mrs_01_0082: + +Logging In to a Cluster +======================= + +- :ref:`MRS Cluster Node Overview ` +- :ref:`Logging In to an ECS ` +- :ref:`Determining Active and Standby Management Nodes of Manager ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + mrs_cluster_node_overview + logging_in_to_an_ecs + determining_active_and_standby_management_nodes_of_manager diff --git a/umn/source/managing_clusters/logging_in_to_a_cluster/logging_in_to_an_ecs.rst b/umn/source/managing_clusters/logging_in_to_a_cluster/logging_in_to_an_ecs.rst new file mode 100644 index 0000000..3a3fc1a --- /dev/null +++ b/umn/source/managing_clusters/logging_in_to_a_cluster/logging_in_to_an_ecs.rst @@ -0,0 +1,127 @@ +:original_name: mrs_01_0083.html + +.. _mrs_01_0083: + +Logging In to an ECS +==================== + +This section describes how to remotely log in to an ECS in an MRS cluster using the remote login (VNC mode) function provided on the ECS management console or a key or password (SSH mode). Remote login (VNC mode) is mainly used for emergency O&M. In other scenarios, it is recommended that you log in to the ECS using SSH. + +.. note:: + + To log in to a cluster node using SSH, you need to manually add an inbound rule in the security group of the cluster. The source address is **Client IPv4 address/32** (or **Client IPv6 address/128**) and the port number is **22**. For details, see **Virtual Private Cloud** > **User Guide** > ****Security > Security Group** > **Adding a Security Group Rule****. + +Logging In to an ECS Using VNC +------------------------------ + +#. Log in to the MRS management console. +#. Click |image1| in the upper-left corner on the management console and select a region and project. +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. +#. On the **Nodes** tab page, click the name of a Master node in the Master node group to log in to the ECS management console. +#. In the upper right corner, click **Remote Login**. +#. Perform subsequent operations by referring to `Login Using VNC `__. + +.. _mrs_01_0083__section5513107114: + +Logging In to an ECS Using a Key Pair (SSH) +------------------------------------------- + +**Logging In to the ECS from Local Windows** + +To log in to the Linux ECS from local Windows, perform the operations described in this section. The following procedure uses PuTTY as an example to log in to the ECS. + +#. Log in to the MRS management console. + +#. Click |image2| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. On the **Nodes** tab page, click the name of a Master node in the Master node group to log in to the ECS management console. + +#. Click the **EIPs** tab, click **Bind EIP** to bind an EIP to the ECS, and record the EIP. If an EIP has been bound to the ECS, skip this step. + +#. Check whether the private key file has been converted to **.ppk** format. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`7 `. + +#. .. _mrs_01_0083__li1090865924810: + + Run PuTTY. + +#. In the **Actions** area, click **Load** and import the private key file you used during ECS creation. + + Ensure that the private key file is in the format of **All files (*.*)**. + +#. Click **Save private key**. + +#. .. _mrs_01_0083__li499810490191: + + Save the converted private key, for example, **kp-123.ppk**, to a local directory. + +#. .. _mrs_01_0083__li99981049191918: + + Run PuTTY. + +#. Choose **Connection** > **Data**. Enter the image username in **Auto-login username**. + + .. note:: + + The image username for cluster nodes is **root**. + +#. Choose **Connection** > **SSH** > **Auth**. In the last configuration item **Private key file for authentication**, click **Browse** and select the private key converted in :ref:`10 `. + +#. Click **Session**. + + a. **Host Name (or IP address)**: Enter the EIP bound to the ECS. + + b. **Port**: Enter **22**. + + c. **Connection Type**: Select **SSH**. + + d. **Saved Sessions**: Task name, which can be clicked for remote connection when you use PuTTY next time + + + .. figure:: /_static/images/en-us_image_0000001295898104.png + :alt: **Figure 1** Clicking **Session** + + **Figure 1** Clicking **Session** + +#. Click **Open** to log in to the ECS. + + If you log in to the ECS for the first time, PuTTY displays a security warning dialog box, asking you whether to accept the ECS security certificate. Click **Yes** to save the certificate to your local registry. + +**Logging In to the ECS from Local Linux** + +To log in to the Linux ECS from local Linux, perform the operations described in this section. The following procedure uses private key file **kp-123.pem** as an example to log in to the ECS. The name of your private key file may differ. + +#. On the Linux CLI, run the following command to change operation permissions: + + **chmod 400 /path/kp-123.pem** + + .. note:: + + In the preceding command, *path* refers to the path where the key file is saved. + +#. Run the following command to log in to the ECS: + + **ssh -i /path/kp-123.pem**\ *Default username*\ @\ *EIP* + + For example, if the default username is **root** and the EIP is **123.123.123.123**, run the following command: + + ssh -i /*path*/kp-123.pem root@123.123.123.123 + + .. note:: + + - *path* indicates the path where the key file is saved. + - *EIP* indicates the EIP bound to the ECS. + - For cluster nodes of versions earlier than MRS 1.6.2, the image username is **Linux**. + - The image username is **root** for cluster nodes of MRS 1.6.2 or later. + +Changing the OS Keyboard Language +--------------------------------- + +All nodes in the MRS cluster run the Linux OS. For details about how to change the OS keyboard language, see **Getting Started** > **Logging In to an ECS** > **Logging In to an ECS Using VNC** in the *Elastic Cloud Server User Guide*. + +.. |image1| image:: /_static/images/en-us_image_0000001349257245.png +.. |image2| image:: /_static/images/en-us_image_0000001349137661.png diff --git a/umn/source/managing_clusters/logging_in_to_a_cluster/mrs_cluster_node_overview.rst b/umn/source/managing_clusters/logging_in_to_a_cluster/mrs_cluster_node_overview.rst new file mode 100644 index 0000000..46efbf4 --- /dev/null +++ b/umn/source/managing_clusters/logging_in_to_a_cluster/mrs_cluster_node_overview.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_0081.html + +.. _mrs_01_0081: + +MRS Cluster Node Overview +========================= + +This section describes remote login, MRS cluster node types, and node functions. + +MRS cluster nodes support remote login. The following remote login methods are available: + +- GUI login: Use the remote login function provided by the ECS management console to log in to the Linux interface of the Master node in the cluster. For details, see :ref:`Logging In to an ECS `. + +- SSH login: Applies to Linux ECSs only. You can use a remote login tool (such as PuTTY) to log in to an ECS. The ECS must have a bound EIP. + + For details about how to apply for and bind EIP for the Master node, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + + You can log in to a Linux ECS using either a key pair or password. + + .. important:: + + If you use a key pair to access a node in a cluster of MRS 1.6.2 earlier, you need to log in to the node as a Linux user. For details, see :ref:`Logging In to an ECS Using a Key Pair (SSH) `. + + If you use a key pair to access a node in a cluster of MRS 1.6.2 or later, you need to log in to the node as user **root**. For details, see :ref:`Logging In to an ECS Using a Key Pair (SSH) `. + +In an MRS cluster, a node is an ECS. :ref:`Table 1 ` describes the node types and node functions. + +.. _mrs_01_0081__table1615555733016: + +.. table:: **Table 1** Cluster node types + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Type | Functions | + +===================================+====================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Master node | Management node of an MRS cluster. It manages and monitors the cluster. In the navigation tree of the MRS management console, choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. On the **Nodes** tab page, view the **Name**. The node that contains **master1** in its name is the Master1 node. The node that contains **master2** in its name is the Master2 node. | + | | | + | | You can log in to a Master node either using VNC on the ECS management console or using SSH. After logging in to the Master node, you can access Core nodes without entering passwords. | + | | | + | | The system automatically deploys the Master nodes in active/standby mode and supports the high availability (HA) feature for MRS cluster management. If the active management node fails, the standby management node switches to the active state and takes over services. | + | | | + | | To determine whether the Master1 node is the active management node, see :ref:`Determining Active and Standby Management Nodes of Manager `. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Core node | Work node of an MRS cluster. It processes and analyzes data and stores process data. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task node | Compute node. It is used for auto scaling when the computing resources in a cluster are insufficient. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/managing_clusters/managing_nodes/canceling_host_isolation.rst b/umn/source/managing_clusters/managing_nodes/canceling_host_isolation.rst new file mode 100644 index 0000000..234ea20 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/canceling_host_isolation.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_0213.html + +.. _mrs_01_0213: + +Canceling Host Isolation +======================== + +Scenario +-------- + +After the exception or fault of a host is handled, you must cancel the isolation of the host for proper usage. + +You can cancel the isolation of a host on MRS. + +Prerequisites +------------- + +- The host is in the **Isolated** state. +- The exception or fault of the host has been rectified. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Nodes**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Canceling Host Isolation `. + +#. Unfold the node group information and select the check box of the target host that you want to cancel its isolation. + +#. Choose **Node Operation** > **Cancel Host Isolation**. + +#. Confirm the information about the host for which the isolation is to be cancelled and click **OK**. + + When **Operation successful** is displayed, click **Finish**. The host is de-isolated successfully, and the value of **Operating Status** becomes **Normal**. + +#. Select the host that has been de-isolated and choose **Node Operation** > **Start All Roles**. diff --git a/umn/source/managing_clusters/managing_nodes/index.rst b/umn/source/managing_clusters/managing_nodes/index.rst new file mode 100644 index 0000000..db93dc8 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_24296.html + +.. _mrs_01_24296: + +Managing Nodes +============== + +- :ref:`Manually Scaling Out a Cluster ` +- :ref:`Manually Scaling In a Cluster ` +- :ref:`Managing a Host (Node) ` +- :ref:`Isolating a Host ` +- :ref:`Canceling Host Isolation ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + manually_scaling_out_a_cluster + manually_scaling_in_a_cluster + managing_a_host_node + isolating_a_host + canceling_host_isolation diff --git a/umn/source/managing_clusters/managing_nodes/isolating_a_host.rst b/umn/source/managing_clusters/managing_nodes/isolating_a_host.rst new file mode 100644 index 0000000..eca5157 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/isolating_a_host.rst @@ -0,0 +1,45 @@ +:original_name: mrs_01_0212.html + +.. _mrs_01_0212: + +Isolating a Host +================ + +Scenario +-------- + +If a host is found to be abnormal or faulty, affecting cluster performance or preventing services from being provided, you can temporarily exclude that host from the available nodes in the cluster. In this way, the client can access other available nodes. In scenarios where patches are to be installed in a cluster, you can also exclude a specified node from patch installation. + +You can isolate a host manually on MRS based on the actual service requirements or O&M plan. Only non-management nodes can be isolated. + +Impact on the System +-------------------- + +- After a host is isolated, all role instances on the host will be stopped. You cannot start, stop, or configure the host and any instances on the host. +- After a host is isolated, statistics of the monitoring status and indicator data of the host hardware and instances cannot be collected or displayed. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Nodes**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Isolating a Host `. + +#. Unfold the node group information and select the check box of the target host. + +#. Choose **Node Operation** > **Isolate Host**. + +#. Confirm the information about the host to be isolated and click **OK**. + + When **Operation successful** is displayed, click **Finish**. The host is isolated successfully, and the value of **Operating Status** becomes **Isolated**. + + .. note:: + + For isolated hosts, you can cancel the isolation and add them to the cluster again. For details, see :ref:`Canceling Host Isolation `. diff --git a/umn/source/managing_clusters/managing_nodes/managing_a_host_node.rst b/umn/source/managing_clusters/managing_nodes/managing_a_host_node.rst new file mode 100644 index 0000000..50e8985 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/managing_a_host_node.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0211.html + +.. _mrs_01_0211: + +Managing a Host (Node) +====================== + +Scenario +-------- + +To check an abnormal or faulty host (node), you need to stop all host roles on MRS. To recover host services after the host fault is rectified, restart all roles. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Nodes**. + + .. note:: + + For versions earlier than MRS 1.7.2, see :ref:`Managing a Host `. + +#. Unfold the node group information and select the check box of the target node. +#. Choose **Node Operation** > **Start All Roles** or **Stop All Roles** to perform the required operation. diff --git a/umn/source/managing_clusters/managing_nodes/manually_scaling_in_a_cluster.rst b/umn/source/managing_clusters/managing_nodes/manually_scaling_in_a_cluster.rst new file mode 100644 index 0000000..8785953 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/manually_scaling_in_a_cluster.rst @@ -0,0 +1,90 @@ +:original_name: mrs_01_0060.html + +.. _mrs_01_0060: + +Manually Scaling In a Cluster +============================= + +You can reduce the number of core or task nodes to scale in a cluster based on service requirements so that MRS delivers better storage and computing capabilities at lower O&M costs. + +The scale-in operation is not allowed for a cluster that is performing active/standby synchronization. + +Background +---------- + +A cluster can have three types of nodes, master, core, and task nodes. Currently, only core and task nodes can be removed. To scale in a cluster, you only need to adjust the number of nodes on the MRS console. MRS then automatically selects the nodes to be removed. + +The policies for MRS to automatically select nodes are as follows: + +- MRS does not select the nodes with basic components installed, such as ZooKeeper, DBService, KrbServer, and LdapServer, because these basic components are the basis for the cluster to run. + +- Core nodes store cluster service data. When scaling in a cluster, ensure that all data on the core nodes to be removed has been migrated to other nodes. You can perform follow-up scale-in operations only after all component services are decommissioned, for example, removing nodes from Manager and deleting ECSs. When selecting core nodes, MRS preferentially selects the nodes with a small amount of data and healthy instances to be decommissioned to prevent decommissioning failures. For example, if DataNodes are installed on core nodes in an analysis cluster, MRS preferentially selects the nodes with small data volume and good health status during scale-in. + + When core nodes are removed, their data is migrated to other nodes. If the user business has cached the data storage path, the client will automatically update the path, which may increase the service processing latency temporarily. Cluster scale-in may slow the response of the first access to some HBase on HDFS data. You can restart HBase or disable or enable related tables to resolve this issue. + +- Task nodes are computing nodes and do not store cluster data. Data migration is not involved in removing task nodes. Therefore, when selecting task nodes, MRS preferentially selects nodes whose health status is faulty, unknown, or subhealthy. On the **Components** tab of the MRS console, click a service and then the **Instances** tab to view the health status of the node instances. + +Scale-In Verification Policy +---------------------------- + +To prevent component decommissioning failures, components provide different decommissioning constraints. Scale-in is allowed only when the constraints of all installed components are met. :ref:`Table 1 ` describes the scale-in verification policies. + +.. _mrs_01_0060__table53894796105039: + +.. table:: **Table 1** Decommissioning constraints + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Component | Constraint | + +===================================+===============================================================================================================================================================================================+ + | HDFS/DataNode | The number of available nodes after the scale-in is greater than or equal to the number of HDFS copies and the total HDFS data volume does not exceed 80% of the total HDFS cluster capacity. | + | | | + | | This ensures that the remaining space is sufficient for storing existing data after the scale-in and reserves some space for future use. | + | | | + | | .. note:: | + | | | + | | To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HBase/RegionServer | The total available memory of RegionServers on all nodes except the nodes to be removed is greater than 1.2 times of the memory which is currently used by RegionServers on these nodes. | + | | | + | | This ensures that the node to which the region on a decommissioned node is migrated has sufficient memory to bear the region of the decommissioned node. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storm/ Supervisor | After the scale-in, ensure that the number of slots in the cluster is sufficient for running the submitted tasks. | + | | | + | | This prevents no sufficient resources being available for running the stream processing tasks after the scale-in. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flume/FlumeServer | If FlumeServer is installed on a node and Flume tasks have been configured for the node, the node cannot be deleted. | + | | | + | | This prevents the deployed service program from being deleted by mistake. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Scaling In a Cluster by Specifying the Node Quantity +---------------------------------------------------- + +#. Log in to the MRS console. + +#. Click |image1| in the upper-left corner on the console and select a region and project. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. Click the **Nodes** tab. In the **Operation** column of the node group, click **Scale In** to go to the **Scale In** page. + + This operation can be performed only when the cluster and all nodes in it are running. + +#. Set **Scale-In Nodes** and click **OK**. + + .. note:: + + - Before scaling in the cluster, check whether its security group configuration is correct. Ensure that an inbound security group rule contains a rule in which **Protocol & Port** is set to **All**, and **Source** is set to a trusted accessible IP address range. + - If damaged data blocks exist in HDFS, the cluster may fail to be scaled in. Contact technical support. + +#. A dialog box displayed in the upper right corner of the page indicates that the task of removing the node is submitted successfully. + + The cluster scale-in process is explained as follows: + + - During scale-in: The cluster status is **Scaling In**. The submitted jobs will be executed, and you can submit new jobs. You are not allowed to continue to scale in or terminate the cluster. You are advised not to restart the cluster or modify the cluster configuration. + - Successful scale-in: The cluster status is **Running**. + - Failed scale-in: The cluster status is **Running**. You can execute jobs or scale-in the cluster again. + + After the cluster is scaled in, you can view the node information of the cluster on the **Nodes** page. + +.. |image1| image:: /_static/images/en-us_image_0000001349137813.png diff --git a/umn/source/managing_clusters/managing_nodes/manually_scaling_out_a_cluster.rst b/umn/source/managing_clusters/managing_nodes/manually_scaling_out_a_cluster.rst new file mode 100644 index 0000000..40f5725 --- /dev/null +++ b/umn/source/managing_clusters/managing_nodes/manually_scaling_out_a_cluster.rst @@ -0,0 +1,101 @@ +:original_name: mrs_01_0041.html + +.. _mrs_01_0041: + +Manually Scaling Out a Cluster +============================== + +The storage and computing capabilities of MRS can be improved by simply adding Core nodes or Task nodes instead of modifying system architecture, reducing O&M costs. Core nodes can process and store data. You can add Core nodes to expand the node quantities and handle peak loads. Task nodes are used for computing and do not store persistent data. + +Background +---------- + +The MRS cluster supports a maximum of 500 Core and Task nodes. If more than 500 Core/Task nodes are required, contact technical support engineers or invoke a background interface to modify the database. + +Core nodes and Task nodes can be added, excluding the Master node. Here, the maximum number of Core/Task nodes to be added is 500 minus the number of Core/Task nodes. For example, the current number of Core nodes is 3, the number of Core nodes to be added must be less than or equal to 497. If the cluster scale-out fails, you can add node to the cluster again. + +If no node is added during cluster creation, you can specify the number of nodes to be added during scale-out. However, you cannot specify the nodes to be added. + +The operations for scaling out a cluster vary depending on the selected version. + +Procedure +--------- + +#. Log in to the MRS console. + +#. Click |image1| in the upper-left corner on the management console and select a region and project. + +#. Choose **Clusters** > **Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. Click the **Nodes** tab. In the **Operation** column of the node group, click **Scale Out**. The **Scale Out** page is displayed. + + The scale-out operation can only be performed on the running clusters. + +#. Set **Scaled Out Nodes** and **Run Bootstrap Action**, and click **OK** + + .. note:: + + - If the Task node group does not exist in the cluster, configure the Task node by referring to :ref:`Adding a Task Node `. + - If a bootstrap action is added during cluster creation, the **Run Bootstrap Action** parameter is valid. If this function is enabled, the bootstrap actions added during cluster creation will be run on all the scaled out nodes. + - If the **New Specifications** parameter is available, the specifications that are the same as those of the original nodes have been sold out or discontinued. Nodes with new specifications will be added. + - Before scaling out the cluster, check whether its security group configuration is correct. Ensure that an inbound security group rule contains a rule in which **Protocol & Port** is set to **All**, and **Source** is set to a trusted accessible IP address range. + +#. In the **Scale Out Node** dialog box, click **OK**. + +#. A dialog box is displayed, indicating that the scale-out task is submitted successfully. + + The following parameters explain the cluster scale-out process: + + - Expanding: If a cluster is being expanded, its status is **Scaling out**. The submitted jobs will be executed and you can submit new jobs. You are not allowed to continue to scale out, or delete the cluster. You are advised not to restart the cluster or modify the cluster configuration. + - Expansion succeeded: If a cluster is expanded successfully, its status is **Running**. + - Failed scale-out: The cluster status is **Running** when the cluster scale-out failed. You can execute jobs and scale out the cluster again. + + After the cluster is scaled out, you can view the node information of the cluster on the **Nodes** page. + +.. _mrs_01_0041__section1077318341361: + +Adding a Task Node +------------------ + +To add a task node to a custom cluster, perform the following steps: + +#. On the cluster details page, click the **Nodes** tab and click **Add Node Group**. The **Add Node Group** page is displayed. +#. Select **NM** for **Deploy Roles** and set other parameters as required. + +To add a task node to a non-custom cluster, perform the following steps: + +#. On the cluster details page, click the **Nodes** tab and click **Configure Task Node**. The **Configure Task Node** page is displayed. +#. On the **Configure Task Node** page, set **Node Type**, **Instance Specifications**, **Nodes**, **System Disk**. In addition, if **Add Data Disk** is enabled, configure the storage type, size, and number of data disks. +#. Click **OK**. + +.. _mrs_01_0041__section8614439391: + +Adding a Node Group +------------------- + +.. note:: + + Used to add node groups and applies to customized clusters of MRS 3.\ *x*. + +#. On the cluster details page, click the **Nodes** tab and click **Add Node Group**. The **Add Node Group** page is displayed. +#. Set the parameters as needed. + + .. table:: **Table 1** Parameters for adding a node group + + +-------------------------+------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================+================================================================================================+ + | Instance Specifications | Select the flavor type of the hosts in the node group. | + +-------------------------+------------------------------------------------------------------------------------------------+ + | Nodes | Set the number of nodes in the node group. | + +-------------------------+------------------------------------------------------------------------------------------------+ + | System Disk | Set the specifications and capacity of the system disk on the new node. | + +-------------------------+------------------------------------------------------------------------------------------------+ + | Data Disk (GB) | Set the specifications, capacity, and number of data disks of the new node. | + +-------------------------+------------------------------------------------------------------------------------------------+ + | Deploy Roles | Deploy the instances of each node in the new node group. The setting can be manually adjusted. | + +-------------------------+------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001349257269.png diff --git a/umn/source/managing_clusters/patch_management/index.rst b/umn/source/managing_clusters/patch_management/index.rst new file mode 100644 index 0000000..11e0545 --- /dev/null +++ b/umn/source/managing_clusters/patch_management/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0409.html + +.. _mrs_01_0409: + +Patch Management +================ + +- :ref:`Patch Operation Guide for Versions Earlier than MRS 1.7.0 ` +- :ref:`Patch Operation Guide for Versions from MRS 1.7.0 to MRS 2.1.0 ` +- :ref:`Rolling Patches ` +- :ref:`Restoring Patches for the Isolated Hosts ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + patch_operation_guide_for_versions_earlier_than_mrs_1.7.0 + patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0 + rolling_patches + restoring_patches_for_the_isolated_hosts diff --git a/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst b/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst new file mode 100644 index 0000000..567c588 --- /dev/null +++ b/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst @@ -0,0 +1,63 @@ +:original_name: mrs_01_0410.html + +.. _mrs_01_0410: + +Patch Operation Guide for Versions Earlier than MRS 1.7.0 +========================================================= + +If you obtain patch information from the following sources, upgrade the patch according to actual requirements. + +- You obtain information about the patch released by MRS from a message pushed by the message center service. +- You obtain information about the patch by accessing the cluster and viewing patch information. + +Preparing for Patch Installation +-------------------------------- + +- Follow instructions in :ref:`Performing a Health Check ` to check cluster status. If the cluster health status is normal, install a patch. +- The administrator has uploaded the cluster patch package to the server. For details, see :ref:`Uploading a Patch Package `. +- You need to confirm the target patch to be installed according to the patch information in the patch content. + +.. _mrs_01_0410__section63677183610: + +Uploading a Patch Package +------------------------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System > Manage Patch**. The **Manage Patch** page is displayed. +#. Click **Upload Patch** and set the following parameters. + + - **Patch File Path**: folder created in the OBS file system where the patch package is stored, for example, **MRS_1.6.2/MRS_1_6_2_11.tar.gz** + - **Parallel File System Name**: name of the OBS file system that stores patch packages, for example, **mrs_patch**. + + .. note:: + + You can obtain the file system name and patch file path on the **Patch Information** tab page. The value of the **Patch Path** is in the following format: *[File system name]*\ **/**\ *[Patch file path]*. + + - **AK**: For details, see **My Credential** > **Access Keys**. + - **SK**: For details, see **My Credential** > **Access Keys**. + +#. Click **OK** to upload the patch. + +Installing a Patch +------------------ + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System > Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **Install**. +#. In the displayed dialog box, click **OK** to install the patch. +#. After the patch is installed, you can view the installation status in the progress bar. If the installation fails, contact the administrator. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System > Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **Uninstall**. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst b/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst new file mode 100644 index 0000000..0efc79c --- /dev/null +++ b/umn/source/managing_clusters/patch_management/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_0411.html + +.. _mrs_01_0411: + +Patch Operation Guide for Versions from MRS 1.7.0 to MRS 2.1.0 +============================================================== + +If you obtain patch information from the following sources, upgrade the patch according to actual requirements. + +- You obtain information about the patch released by MRS from a message pushed by the message center service. +- You obtain information about the patch by accessing the cluster and viewing patch information. + +Preparing for Patch Installation +-------------------------------- + +- Follow instructions in :ref:`Performing a Health Check ` to check cluster status. If the cluster health status is normal, install a patch. +- You need to confirm the target patch to be installed according to the patch information in the patch content. + +Installing a Patch +------------------ + +#. Log in to the MRS console. +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patches** tab page, click **Install** in the **Operation** column to install the target patch. + + .. note:: + + - Clusters of versions later than MRS 1.7.2 support rolling patch installation. For details, see :ref:`Rolling Patches `. + - For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Log in to the MRS console. +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patches** page, click **Uninstall** in the **Operation** column to uninstall the target patch. + + .. note:: + + - Clusters of versions later than MRS 1.7.2 support rolling patch installation. For details, see :ref:`Rolling Patches `. + - For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/managing_clusters/patch_management/restoring_patches_for_the_isolated_hosts.rst b/umn/source/managing_clusters/patch_management/restoring_patches_for_the_isolated_hosts.rst new file mode 100644 index 0000000..2cfcc8c --- /dev/null +++ b/umn/source/managing_clusters/patch_management/restoring_patches_for_the_isolated_hosts.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0412.html + +.. _mrs_01_0412: + +Restoring Patches for the Isolated Hosts +======================================== + +If some hosts are isolated in a cluster, perform the following operations to restore patches for these isolated hosts after patch installation on other hosts in the cluster. After patch restoration, versions of the isolated host nodes are consistent with those are not isolated. + +.. note:: + + In **MRS 3.x**, you cannot perform operations in this section on the management console. + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System > Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **View Details**. +#. On the patch details page, select host nodes whose **Status** is **Isolated**. +#. Click **Select and Restore** to restore the isolated host nodes. diff --git a/umn/source/managing_clusters/patch_management/rolling_patches.rst b/umn/source/managing_clusters/patch_management/rolling_patches.rst new file mode 100644 index 0000000..3742de7 --- /dev/null +++ b/umn/source/managing_clusters/patch_management/rolling_patches.rst @@ -0,0 +1,105 @@ +:original_name: mrs_01_0431.html + +.. _mrs_01_0431: + +Rolling Patches +=============== + +The rolling patch function indicates that patches are installed or uninstalled for one or more services in a cluster by performing a rolling service restart (restarting services or instances in batches), without interrupting the services or within a minimized service interruption interval. Services in a cluster are divided into the following three types based on whether they support rolling patch: + +- Services supporting rolling patch installation or uninstallation: All businesses or part of them (varying depending on different services) of the services are not interrupted during patch installation or uninstallation. +- Services not supporting rolling patch installation or uninstallation: Businesses of the services are interrupted during patch installation or uninstallation. +- Services with some roles supporting rolling patch installation or uninstallation: Some businesses of the services are not interrupted during patch installation or uninstallation. + +.. note:: + + In **MRS 3.x**, you cannot perform operations in this section on the management console. + +:ref:`Table 1 ` provides services and instances that support or do not support rolling restart in the MRS cluster. + +.. _mrs_01_0431__table054720341161: + +.. table:: **Table 1** Services and instances that support or do not support rolling restart + + ========= ================ ================================== + Service Instance Whether to Support Rolling Restart + ========= ================ ================================== + HDFS NameNode Yes + \ Zkfc + \ JournalNode + \ HttpFS + \ DataNode + Yarn ResourceManager Yes + \ NodeManager + Hive MetaStore Yes + \ WebHCat + \ HiveServer + MapReduce JobHistoryServer Yes + HBase HMaster Yes + \ RegionServer + \ ThriftServer + \ RESTServer + Spark JobHistory Yes + \ JDBCServer + \ SparkResource No + Hue Hue No + Tez TezUI No + Loader Sqoop No + Zookeeper Quorumpeer Yes + Kafka Broker Yes + \ MirrorMaker No + Flume Flume Yes + \ MonitorServer + Storm Nimbus Yes + \ UI + \ Supervisor + \ LogViewer + ========= ================ ================================== + +Installing a Patch +------------------ + +#. Log in to the MRS console. +#. Choose **Clusters** > **Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patches** page, click **Install** in the **Operation** column. +#. On the **Warning** page, enable or disable **Rolling Patch**. + + .. note:: + + - Enabling the rolling patch installation function: Services are not stopped before patch installation, and rolling service restart is performed after the patch installation. This minimizes the impact on cluster services but takes more time than common patch installation. + - Disabling the rolling patch uninstallation function: All services are stopped before patch uninstallation, and all services are restarted after the patch uninstallation. This temporarily interrupts the cluster and the services but takes less time than rolling patch uninstallation. + - The rolling patch installation function is not available in clusters with less than two Master nodes and three Core nodes. + +#. Click **Yes** to install the target patch. +#. View the patch installation progress. + + a. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + b. Choose **System** > **Manage Patch**. On the **Manage Patch** page, you can view the patch installation progress. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Log in to the MRS console. +#. Choose **Clusters** > **Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patches** page, click **Uninstall** in the **Operation** column. +#. On the **Warning** page, enable or disable **Rolling Patch**. + + .. note:: + + - Enabling the rolling patch uninstallation function: Services are not stopped before patch uninstallation, and rolling service restart is performed after the patch uninstallation. This minimizes the impact on cluster services but takes more time than common patch uninstallation. + - Disabling the rolling patch uninstallation function: All services are stopped before patch uninstallation, and all services are restarted after the patch uninstallation. This temporarily interrupts the cluster and the services but takes less time than rolling patch uninstallation. + - Only patches that are installed in rolling mode can be uninstalled in the same mode. + +#. Click **Yes** to uninstall the target patch. +#. View the patch uninstallation progress. + + a. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + b. Choose **System** > **Manage Patch**. On the **Manage Patch** page, you can view the patch uninstallation progress. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/managing_clusters/tenant_management/before_you_start.rst b/umn/source/managing_clusters/tenant_management/before_you_start.rst new file mode 100644 index 0000000..1cdc0b3 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/before_you_start.rst @@ -0,0 +1,12 @@ +:original_name: mrs_01_0604.html + +.. _mrs_01_0604: + +Before You Start +================ + +This section describes how to manage tenants on the MRS console. + +Tenant management operations on the console apply only to clusters of versions earlier than MRS 3.x. + +Tenant management operations on FusionInsight Manager apply to all versions. For MRS 3.x and later versions, see :ref:`Overview `. For versions earlier than MRS 3.x, see :ref:`Overview `. diff --git a/umn/source/managing_clusters/tenant_management/clearing_configuration_of_a_queue.rst b/umn/source/managing_clusters/tenant_management/clearing_configuration_of_a_queue.rst new file mode 100644 index 0000000..f9febf1 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/clearing_configuration_of_a_queue.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0315.html + +.. _mrs_01_0315: + +Clearing Configuration of a Queue +================================= + +Scenario +-------- + +Users can clear the configuration of a queue on MRS Manager when the queue does not need resources from a resource pool or if a resource pool needs to be disassociated from the queue. Clearing queue configurations means that the resource capacity policy of the queue is canceled. + +Prerequisites +------------- + +- If a queue is to be unbound from a resource pool, this resource pool cannot serve as the default resource pool of the queue. Therefore, you must first change the default resource pool of the queue to another one. For details, see :ref:`Configuring a Queue `. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Clearing Configuration of a Queue `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Resource Distribution Policies** tab. + +#. In **Resource Pools**, select a specified resource pool. + +#. Locate the specified queue in the **Resource Allocation** table, and click **Clear** in the **Operation** column + + In the **Clear Queue Configuration** dialog box, click **OK** to clear the queue configuration in the current resource pool. + + .. note:: + + If no resource capacity policy is configured for a queue, the clearing function is unavailable for the queue by default. diff --git a/umn/source/managing_clusters/tenant_management/configuring_a_queue.rst b/umn/source/managing_clusters/tenant_management/configuring_a_queue.rst new file mode 100644 index 0000000..16254d3 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/configuring_a_queue.rst @@ -0,0 +1,93 @@ +:original_name: mrs_01_0313.html + +.. _mrs_01_0313: + +Configuring a Queue +=================== + +Scenario +-------- + +You can modify the queue configuration of a specified tenant on MRS based on service requirements. + +Prerequisites +------------- + +- A tenant associated with Yarn and allocated dynamic resources has been added. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Configuring a Queue `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Queue Configuration** tab. + +#. In the tenant queue table, click **Modify** in the **Operation** column of the specified tenant queue. + + .. note:: + + - In the tenant list on the left of the **Tenant Management** tab, click the target tenant. In the window that is displayed, choose **Resource**. On the page that is displayed, click |image1| to open the queue modification page. + - A queue can be bound to only one non-default resource pool. + + Versions earlier than MRS 3.x: + + .. table:: **Table 1** Queue configuration parameters + + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=======================================================+=================================================================================================================================================================================================================================================================+ + | Maximum Applications | Specifies the maximum number of applications. The value ranges from 1 to 2147483647. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum AM Resource Percent | Specifies the maximum percentage of resources that can be used to run the ApplicationMaster in a cluster. The value ranges from 0 to 1. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum User Limit Percent (%) | Specifies the minimum percentage of resources consumed by a user. The value ranges from 0 to 100. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Limit Factor | Specifies the limit factor of the maximum user resource usage. The maximum user resource usage percentage can be obtained by multiplying the limit factor with the percentage of the tenant's actual resource usage in the cluster. The minimum value is **0**. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Specifies the current status of a resource plan. The values are **Running** and **Stopped**. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool (Default Node Label Expression) | Specifies the resource pool used by a queue. The default value is **default**. If you want to change the resource pool, configure the queue capacity first. For details, see :ref:`Configuring the Queue Capacity Policy of a Resource Pool `. | + +-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + MRS 3.x or later: + + .. table:: **Table 2** Queue configuration parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=======================================================================================================================================================================================================================================================================================================================================+ + | Max Master Shares (%) | Indicates the maximum percentage of resources occupied by all ApplicationMasters in the current queue. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Allocated vCores | Indicates the maximum number of cores that can be allocated to a single YARN container in the current queue. The default value is **-1**, indicating that the number of cores is not limited within the value range. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Allocated Memory (MB) | Indicates the maximum memory that can be allocated to a single Yarn container in the current queue. The default value is **-1**, indicating that the memory is not limited within the value range. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Running Apps | Maximum number of tasks that can be executed at the same time in the current queue. The default value is **-1**, indicating that the number is not limited within the value range (the meaning is the same if the value is empty). The value 0 indicates that the task cannot be executed. The value ranges from -1 to 2147483647. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Running Apps per User | Maximum number of tasks that can be executed by each user in the current queue at the same time. The default value is **-1**, indicating that the number is not limited within the value range. If the value is **0**, the task cannot be executed. The value ranges from -1 to 2147483647. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Max Pending Apps | Maximum number of tasks that can be suspended at the same time in the current queue. The default value is **-1**, indicating that the number is not limited within the value range (the meaning is the same if the value is empty). The value **0** indicates that tasks cannot be suspended. The value ranges from -1 to 2147483647. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Allocation Rule | Indicates the rule for allocating resources to different tasks of a user. The rule can be FIFO or FAIR. | + | | | + | | If a user submits multiple tasks in the current queue and the rule is FIFO, the tasks are executed one by one in sequential order. If the rule is FAIR, resources are evenly allocated to all tasks. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Label | Indicates that tasks are executed on a node with a specified resource label. | + | | | + | | .. note:: | + | | | + | | If you need to use a new resource pool, change the default label to the new resource pool label. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Active | - **ACTIVE**: indicates that the current queue can receive and execute tasks. | + | | - **INACTIVE**: indicates that the current queue can receive but cannot execute tasks. Tasks submitted to the queue are suspended. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Open | - **OPEN**: indicates that the current queue is opened. | + | | - **CLOSED**: indicates that the current queue is closed. Tasks submitted to the queue are rejected. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001296057872.png diff --git a/umn/source/managing_clusters/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst b/umn/source/managing_clusters/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst new file mode 100644 index 0000000..b4fdd42 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_0314.html + +.. _mrs_01_0314: + +Configuring the Queue Capacity Policy of a Resource Pool +======================================================== + +Scenario +-------- + +After a resource pool is added, the capacity policies of available resources need to be configured for Yarn task queues. This ensures that tasks in the resource pool are running properly. Each queue can be configured with the queue capacity policy of only one resource pool. Users can view the queues in any resource pool and configure queue capacity policies. After the queue policies are configured, Yarn task queues and resource pools are associated. + +You can configure queue policies on MRS. + +Prerequisites +------------- + +- A resource pool has been added. +- The task queues are not associated with other resource pools. By default, all queues are associated with the **default** resource pool. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Configuring the Queue Capacity Policy of a Resource Pool `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Resource Distribution Policies** tab. + +#. In **Resource Pools**, select a specified resource pool. + + **Available Resource Quota**: indicates that all resources in each resource pool are available for queues by default. + +#. Locate the specified queue in the **Resource Allocation** table, and click **Modify** in the **Operation** column. + +#. In **Modify Resource Allocation**, configure the resource capacity policy of the task queue in the resource pool. + + - **Capacity (%)**: specifies the percentage of the current tenant's computing resource usage. + - **Maximum Capacity (%)**: specifies the percentage of the current tenant's maximum computing resource usage. + +#. Click **OK** to save the settings. diff --git a/umn/source/managing_clusters/tenant_management/creating_a_resource_pool.rst b/umn/source/managing_clusters/tenant_management/creating_a_resource_pool.rst new file mode 100644 index 0000000..d89d67d --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/creating_a_resource_pool.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0310.html + +.. _mrs_01_0310: + +Creating a Resource Pool +======================== + +Scenario +-------- + +In an MRS cluster, users can logically divide Yarn cluster nodes to combine multiple NodeManagers into a Yarn resource pool. Each NodeManager belongs to one resource pool only. The system contains a **default** resource pool by default. All NodeManagers that are not added to customized resource pools belong to this resource pool. + +You can create a customized resource pool on MRS and add hosts that have not been added to other customized resource pools to it. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Creating a Resource Pool `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Resource Pools** tab. +#. Click **Create Resource Pool**. +#. In **Create Resource Pool**, set the properties of the resource pool. + + - **Name**: Enter a name for the resource pool. The name of the newly created resource pool cannot be **default**. + + The name consists of 1 to 20 characters and can contain digits, letters, and underscores (_) but cannot start with an underscore (_). + + - **Available Hosts**: In the host list on the left, select a specified host name and add it to the resource pool. Only hosts in the cluster can be selected. The host list of a resource pool can be left blank. + +#. Click **OK**. +#. After a resource pool is created, users can view the **Name**, **Members**, **Type**, **vCore** and **Memory** in the resource pool list. Hosts that are added to the customized resource pool are no longer members of the **default** resource pool. diff --git a/umn/source/managing_clusters/tenant_management/creating_a_sub-tenant.rst b/umn/source/managing_clusters/tenant_management/creating_a_sub-tenant.rst new file mode 100644 index 0000000..b4a4257 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/creating_a_sub-tenant.rst @@ -0,0 +1,73 @@ +:original_name: mrs_01_0306.html + +.. _mrs_01_0306: + +Creating a Sub-tenant +===================== + +Scenario +-------- + +You can create a sub-tenant on MRS if the resources of the current tenant need to be further allocated. + +Prerequisites +------------- + +- A parent tenant has been added. +- A tenant name has been planned. The name must not be the same as that of a role or Yarn queue that exists in the current cluster. +- If a sub-tenant requires storage resources, a storage directory has been planned based on service requirements, and the planned directory does not exist under the storage directory of the parent tenant. +- The resources that can be allocated to the current tenant have been planned and the sum of the resource percentages of direct sub-tenants under the parent tenant at every level does not exceed 100%. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Creating a Sub-tenant `. For MRS 3.x or later, see :ref:`Overview `. + +#. In the tenant list on the left, move the cursor to the tenant node to which a sub-tenant is to be added. Click **Create sub-tenant**. On the displayed page, configure the sub-tenant attributes according to the following table: + + .. table:: **Table 1** Sub-tenant parameters + + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================================+===========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Parent tenant | Specifies the name of the parent tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Specifies the name of the current tenant. The value consists of 3 to 20 characters, and can contain letters, digits, and underscores (_). | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | The options include **Leaf** and **Non-leaf**. If **Leaf** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. If **Non-leaf** is selected, sub-tenants can be added to the current tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dynamic Resource | Specifies the dynamic computing resources for the current tenant. The system automatically creates a task queue named after the sub-tenant name in the Yarn parent queue. When dynamic resources are not **Yarn**, the system does not automatically create a task queue. If the parent tenant does not have dynamic resources, the sub-tenant cannot use dynamic resources. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Specifies the percentage of the resources used by the current tenant. The base value is the total resources of the parent tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max. Capacity (%) | Specifies the maximum percentage of the computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. The system automatically creates a file in the HDFS parent tenant directory. The file is named the same as the name of the sub-tenant. If storage resources are not **HDFS**, the system does not create a storage directory under the root directory of HDFS. If the parent tenant does not have storage resources, the sub-tenant cannot use storage resources. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota (MB) | Specifies the quota for HDFS storage space used by the current tenant. The minimum value is 1, and the maximum value is the total storage quota of the parent tenant. The unit is MB. This parameter indicates the maximum HDFS storage space that can be used by a tenant, but does not indicate the actual space used. If the value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. If the quota is greater than the quota of the parent tenant, the actual storage capacity is subject to the quota of the parent tenant. | + | | | + | | .. note:: | + | | | + | | To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Specifies the tenant's HDFS storage directory. The system automatically creates a file folder named after the sub-tenant name in the directory of the parent tenant by default. For example, if the sub-tenant is **ta1s** and the parent directory is **tenant/ta1**, the system sets this parameter for the sub-tenant to **tenant/ta1/ta1s**. The storage path is customizable in the parent directory. The parent directory for the storage path must be the storage directory of the parent tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies other service resources associated with the current tenant. HBase is supported. To configure this parameter, click **Associate Services**. In the dialog box that is displayed, set **Service** to **HBase**. If **Association Mode** is set to **Exclusive**, service resources are occupied exclusively. If **share** is selected, service resources are shared. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Specifies the description of the current tenant. | + +-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the settings. + + It takes a few minutes to save the settings. If the **Tenant created successfully** is displayed in the upper-right corner, the tenant is added successfully. The tenant is created successfully. + + .. note:: + + - Roles, computing resources, and storage resources are automatically created when tenants are created. + - The new role has permissions on the computing and storage resources. The role and its permissions are controlled by the system automatically and cannot be controlled manually under **Manage Role**. + - When using this tenant, create a system user and assign the user a related tenant role. For details, see :ref:`Creating a User `. diff --git a/umn/source/managing_clusters/tenant_management/creating_a_tenant.rst b/umn/source/managing_clusters/tenant_management/creating_a_tenant.rst new file mode 100644 index 0000000..e66456e --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/creating_a_tenant.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_0305.html + +.. _mrs_01_0305: + +Creating a Tenant +================= + +Scenario +-------- + +You can create a tenant on MRS Manager to specify the resource usage. + +Prerequisites +------------- + +- A tenant name has been planned. The name must not be the same as that of a role or Yarn queue that exists in the current cluster. +- If a tenant requires storage resources, a storage directory has been planned based on service requirements, and the planned directory does not exist under the HDFS directory. +- The resources that can be allocated to the current tenant have been planned and the sum of the resource percentages of direct sub-tenants under the parent tenant at every level does not exceed 100%. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Creating a Tenant `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click **Create Tenant**. On the page that is displayed, configure tenant properties. + + .. table:: **Table 1** Tenant parameters + + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================================+==========================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Specifies the name of the current tenant. The value consists of 3 to 50 characters, and can contain letters, digits, and underscores (_). | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | The options include **Leaf** and **Non-leaf**. If **Leaf** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. If **Non-leaf** is selected, sub-tenants can be added to the current tenant. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dynamic Resource | Specifies the dynamic computing resources for the current tenant. The system automatically creates a task queue named after the tenant name in Yarn. When dynamic resources are not **Yarn**, the system does not automatically create a task queue. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Specifies the percentage of the computing resources used by the current tenant in the **default** resource pool. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max. Capacity (%) | Specifies the maximum percentage of the computing resources used by the current tenant in the **default** resource pool. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. The system automatically creates a file folder named after the tenant name in the **/tenant** directory. When a tenant is created for the first time, the system automatically creates the **/tenant** directory in the HDFS root directory. If storage resources are not **HDFS**, the system does not create a storage directory under the root directory of HDFS. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota (MB) | Specifies the quota for HDFS storage space used by the current tenant. The value ranges from **1** to **8796093022208**. The unit is MB. This parameter indicates the maximum HDFS storage space that can be used by a tenant, but does not indicate the actual space used. If the value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the HDFS physical disk. | + | | | + | | .. note:: | + | | | + | | To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Specifies the tenant's HDFS storage directory. The system automatically creates a file folder named after the tenant name in the **/tenant** directory by default. For example, the default HDFS storage directory for **ta1** is **tenant/ta1**. When a tenant is created for the first time, the system automatically creates the **/tenant** directory in the HDFS root directory. The storage path is customizable. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies other service resources associated with the current tenant. HBase is supported. To configure this parameter, click **Associate Services**. In the dialog box that is displayed, set **Service** to **HBase**. If **Association Mode** is set to **Exclusive**, service resources are occupied exclusively. If **share** is selected, service resources are shared. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Specifies the description of the current tenant. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the settings. + + It takes a few minutes to save the settings. If the **Tenant created successfully** is displayed in the upper-right corner, the tenant is added successfully. The tenant is created successfully. + + .. note:: + + - Roles, computing resources, and storage resources are automatically created when tenants are created. + - The new role has permissions on the computing and storage resources. The role and its permissions are controlled by the system automatically and cannot be controlled manually under **Manage Role**. + - If you want to use the tenant, create a system user and assign the Manager_tenant role and the role corresponding to the tenant to the user. For details, see :ref:`Creating a User `. + +Related Tasks +------------- + +**View an added tenant.** + +#. On the MRS details page, click **Tenants**. + +#. In the tenant list on the left, click the name of the added tenant. + + The **Summary** tab is displayed on the right by default. + +#. View **Basic Information**, **Resource Quota**, and **Statistics** of the tenant. + + If HDFS is in the **Stopped** state, **Available** and **Used** of **Space** in **Resource Quota** are **unknown.** diff --git a/umn/source/managing_clusters/tenant_management/deleting_a_resource_pool.rst b/umn/source/managing_clusters/tenant_management/deleting_a_resource_pool.rst new file mode 100644 index 0000000..3b28a0a --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/deleting_a_resource_pool.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0312.html + +.. _mrs_01_0312: + +Deleting a Resource Pool +======================== + +Scenario +-------- + +You can delete an existing resource pool on MRS. + +Prerequisites +------------- + +- Any queue in a cluster cannot use the resource pool to be deleted as the default resource pool. Before deleting the resource pool, cancel the default resource pool. For details, see :ref:`Configuring a Queue `. +- Resource distribution policies of all queues have been cleared from the resource pool being deleted. For details, see :ref:`Clearing Configuration of a Queue `. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenant**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Deleting a Resource Pool `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Resource Pools** tab. + +#. Locate the row that contains the specified resource pool, and click **Delete** in the **Operation** column. + + In the displayed dialog box, click **OK**. diff --git a/umn/source/managing_clusters/tenant_management/deleting_a_tenant.rst b/umn/source/managing_clusters/tenant_management/deleting_a_tenant.rst new file mode 100644 index 0000000..8b5d623 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/deleting_a_tenant.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_0307.html + +.. _mrs_01_0307: + +Deleting a Tenant +================= + +Scenario +-------- + +You can delete a tenant that is not required on MRS. + +Prerequisites +------------- + +- A tenant has been added. +- You have checked whether the tenant to be deleted has sub-tenants. If the tenant has sub-tenants, delete them; otherwise, you cannot delete the tenant. +- The role of the tenant to be deleted cannot be associated with any user or user group. For details about how to cancel the binding between a role and a user, see :ref:`Modifying User Information `. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Deleting a tenant `. For MRS 3.x or later, see :ref:`Overview `. + +#. In the tenant list on the left, move the cursor to the tenant node to be deleted and click **Delete**. + + The **Delete Tenant** dialog box is displayed. If you want to save the tenant data, select **Reserve the data of this tenant**. Otherwise, the tenant's storage space will be deleted. + +#. Click **OK**. + + It takes a few minutes to save the configuration. After the tenant is deleted successfully, the role and storage space of the tenant are also deleted. + + .. note:: + + - After the tenant is deleted, the task queue of the tenant still exists in Yarn. + - If you choose not to reserve data when deleting the parent tenant, data of sub-tenants is also deleted if the sub-tenants use storage resources. diff --git a/umn/source/managing_clusters/tenant_management/index.rst b/umn/source/managing_clusters/tenant_management/index.rst new file mode 100644 index 0000000..b104978 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/index.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0303.html + +.. _mrs_01_0303: + +Tenant Management +================= + +- :ref:`Before You Start ` +- :ref:`Overview ` +- :ref:`Creating a Tenant ` +- :ref:`Creating a Sub-tenant ` +- :ref:`Deleting a Tenant ` +- :ref:`Managing a Tenant Directory ` +- :ref:`Restoring Tenant Data ` +- :ref:`Creating a Resource Pool ` +- :ref:`Modifying a Resource Pool ` +- :ref:`Deleting a Resource Pool ` +- :ref:`Configuring a Queue ` +- :ref:`Configuring the Queue Capacity Policy of a Resource Pool ` +- :ref:`Clearing Configuration of a Queue ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + before_you_start + overview + creating_a_tenant + creating_a_sub-tenant + deleting_a_tenant + managing_a_tenant_directory + restoring_tenant_data + creating_a_resource_pool + modifying_a_resource_pool + deleting_a_resource_pool + configuring_a_queue + configuring_the_queue_capacity_policy_of_a_resource_pool + clearing_configuration_of_a_queue diff --git a/umn/source/managing_clusters/tenant_management/managing_a_tenant_directory.rst b/umn/source/managing_clusters/tenant_management/managing_a_tenant_directory.rst new file mode 100644 index 0000000..bb5df02 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/managing_a_tenant_directory.rst @@ -0,0 +1,114 @@ +:original_name: mrs_01_0308.html + +.. _mrs_01_0308: + +Managing a Tenant Directory +=========================== + +Scenario +-------- + +You can manage the HDFS storage directory used by a specific tenant on MRS. The management operations include adding a tenant directory, modifying the directory file quota, modifying the storage space, and deleting a directory. + +Prerequisites +------------- + +- A tenant associated with HDFS storage resources has been added. +- You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +- View a tenant directory. + + #. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Managing a Tenant Directory `. For MRS 3.x or later, see :ref:`Overview `. + + #. In the tenant list on the left, click the target tenant. + #. Click the **Resources** tab. + #. View the **HDFS Storage** table. + + - The **Maximum Number of Files/Directories** column indicates the quotas for the file and directory quantity of the tenant directory. + - The **Space Quota** column indicates storage space size of tenant directories. + +- Add a tenant directory. + + #. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Managing a Tenant Directory `. For MRS 3.x or later, see :ref:`Overview `. + + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be added. + #. Click the **Resources** tab. + #. In the **HDFS Storage** table, click **Create Directory**. + + - Set **Path** to a tenant directory path. + + .. note:: + + - If the current tenant is not a sub-tenant, the new path is created in the HDFS root directory. + - If the current tenant is a sub-tenant, the new path is created in the specified directory. + + A complete HDFS storage directory can contain a maximum of 1,023 characters. An HDFS directory name contains digits, letters, spaces, and underscores (_). The name cannot start or end with a space. + + - Set **Maximum Number of Files/Directories** to the quotas of file and directory quantity. + + **Maximum Number of Files/Directories** is optional. Its value ranges from **1** to **9223372036854775806**. + + - Set **Storage Space Quota** to the storage space size of the tenant directory. + + The value of **Storage Space Quota** ranges from **1** to **8796093022208**. + + .. note:: + + To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value of **Storage Space Quota** is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). + + #. Click **OK**. The system creates tenant directories in the HDFS root directory. + +- Modify a tenant directory. + + #. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Managing a Tenant Directory `. For MRS 3.x or later, see :ref:`Overview `. + + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be modified. + #. Click the **Resources** tab. + #. In the **HDFS Storage** table, click **Modify** in the **Operation** column of the specified tenant directory. + + - Set **Maximum Number of Files/Directories** to the quotas of file and directory quantity. + + **Maximum Number of Files/Directories** is optional. Its value ranges from **1** to **9223372036854775806**. + + - Set **Storage Space Quota** to the storage space size of the tenant directory. + + The value of **Storage Space Quota** ranges from **1** to **8796093022208**. + + .. note:: + + To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value of **Storage Space Quota** is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). + + #. Click **OK**. + +- Delete a tenant directory. + + #. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Managing a Tenant Directory `. For MRS 3.x or later, see :ref:`Overview `. + + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be deleted. + + #. Click the **Resources** tab. + + #. In the **HDFS Storage** table, click **Delete** in the **Operation** column of the specified tenant directory. + + The default HDFS storage directory set during tenant creation cannot be deleted. Only the newly added HDFS storage directory can be deleted. + + #. Click **OK**. The tenant directory is deleted. diff --git a/umn/source/managing_clusters/tenant_management/modifying_a_resource_pool.rst b/umn/source/managing_clusters/tenant_management/modifying_a_resource_pool.rst new file mode 100644 index 0000000..bf075ea --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/modifying_a_resource_pool.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0311.html + +.. _mrs_01_0311: + +Modifying a Resource Pool +========================= + +Scenario +-------- + +You can modify members of an existing resource pool on MRS. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Modifying a Resource Pool `. For MRS 3.x or later, see :ref:`Overview `. + +#. Click the **Resource Pools** tab. +#. Locate the row that contains the specified resource pool, and click **Modify** in the **Operation** column. +#. In **Modify Resource Pool**, modify **Added Hosts**. + + - Adding a host: In the host list on the left, select the specified host name and add it to the resource pool. + - Deleting a host: In the host list on the right, click |image1| next to a host to remove the host from the resource pool. The host list of a resource pool can be left blank. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001349057681.png diff --git a/umn/source/managing_clusters/tenant_management/overview.rst b/umn/source/managing_clusters/tenant_management/overview.rst new file mode 100644 index 0000000..44bcbd0 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/overview.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0304.html + +.. _mrs_01_0304: + +Overview +======== + +Definition +---------- + +An MRS cluster provides various resources and services for multiple organizations, departments, or applications to share. The cluster provides tenants as a logical entity to use these resources and services. A mode involving different tenants is called multi-tenant mode. Currently, only the analysis cluster supports tenant management. + +Principles +---------- + +The MRS cluster provides the multi-tenant function. It supports a layered tenant model and allows dynamic adding or deleting of tenants to isolate resources. It dynamically manages and configures tenants' computing and storage resources. + +The computing resources indicate tenants' Yarn task queue resources. The task queue quota can be modified, and the task queue usage status and statistics can be viewed. + +The storage resources can be stored on HDFS. You can add and delete the HDFS storage directories of tenants, and set the quotas of file quantity and the storage space of the directories. + +Tenants can create and manage tenants in a cluster based on service requirements. + +- Roles, computing resources, and storage resources are automatically created when tenants are created. By default, all permissions of the new computing resources and storage resources are allocated to a tenant's roles. +- Permissions to view the current tenant's resources, add a subtenant, and manage the subtenant's resources are granted to the tenant's roles by default. +- After you have modified the tenant's computing or storage resources, permissions of the tenant's roles are automatically updated. + +MRS supports a maximum of 512 tenants. The default tenants created by the system include **default**. Tenants that are in the topmost layer with the default tenant are called level-1 tenants. + +Resource Pools +-------------- + +Yarn task queues support only the label-based scheduling policy. This policy enables Yarn task queues to associate NodeManagers that have specific node labels. In this way, Yarn tasks run on specified nodes so that tasks are scheduled and certain hardware resources are utilized. For example, Yarn tasks requiring a large memory capacity can run on nodes with a large memory capacity by means of label association, preventing poor service performance. + +In an MRS cluster, the tenant logically divides Yarn cluster nodes to combine multiple NodeManagers into a resource pool. Yarn task queues can be associated with specified resource pools by configuring queue capacity policies, ensuring efficient and independent resource utilization in the resource pools. + +MRS supports a maximum of 50 resource pools. By default, the system contains a **default** resource pool. diff --git a/umn/source/managing_clusters/tenant_management/restoring_tenant_data.rst b/umn/source/managing_clusters/tenant_management/restoring_tenant_data.rst new file mode 100644 index 0000000..884f0e0 --- /dev/null +++ b/umn/source/managing_clusters/tenant_management/restoring_tenant_data.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0309.html + +.. _mrs_01_0309: + +Restoring Tenant Data +===================== + +Scenario +-------- + +Tenant data is stored on Manager and in cluster components by default. When components are restored from faults or reinstalled, some tenant configuration data may be abnormal. In this case, you can manually restore the tenant data. + +Prerequisites +------------- + +You have synchronized IAM users. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +Procedure +--------- + +#. On the MRS details page, click **Tenants**. + + .. note:: + + For MRS 1.7.2 or earlier, see :ref:`Restoring Tenant Data `. For MRS 3.x or later, see :ref:`Overview `. + +#. In the tenant list on the left, click a tenant node. + +#. Check the status of the tenant data. + + a. In **Summary**, check the color of the circle on the left of **Basic Information**. Green indicates that the tenant is available and gray indicates that the tenant is unavailable. + b. Click **Resources** and check the circle on the left of **Yarn** or **HDFS Storage**. Green indicates that the resource is available, and gray indicates that the resource is unavailable. + c. Click **Service Association** and check the **Status** column of the associated service table. **Good** indicates that the component can provide services for the associated tenant. **Bad** indicates that the component cannot provide services for the tenant. + d. If any check result is abnormal, go to :ref:`4 ` to restore tenant data. + +#. .. _mrs_01_0309__li10849798195335: + + Click **Restore Tenant Data**. + +#. In the **Restore Tenant Data** window, select one or more components whose data needs to be restored. Click **OK**. The system automatically restores the tenant data. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_an_alarm_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_an_alarm_threshold.rst new file mode 100644 index 0000000..f6f53d2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_an_alarm_threshold.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_0238.html + +.. _mrs_01_0238: + +Configuring an Alarm Threshold +============================== + +Scenario +-------- + +You can configure an alarm threshold to learn the metric health status. After **Send Alarm** is selected, the system sends an alarm message when the monitored data reaches the alarm threshold. You can view the alarm information in **Alarms**. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In **Configuration**, click **Configure Alarm Threshold** under **Monitoring and Alarm**, select monitoring metrics as planned, and set their baselines. + +#. Click a metric, for example, **CPU Usage**, and click **Create Rule**. + +#. Set the monitoring metric rule parameters on the displayed configuration page. + + .. table:: **Table 1** Monitoring metric rule parameters + + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Value | Description | + +=======================+===============================+===============================================================================================================================================================================================================================================================================================================================================+ + | Rule Name | CPU_MAX (example value) | Specifies the rule name. | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Reference Date | 2014/11/06 (example) | Specifies the date on which the reference indicator history is generated. | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Threshold Type | - Max. value | Specifies the maximum or minimum value of a metric. If this parameter is set to **Max. Value**, the system generates an alarm when the actual value of the metric is greater than the threshold. If this parameter is set to **Min. Value**, the system generates an alarm when the actual value of the metric is smaller than the threshold. | + | | - Min. value | | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Alarm Severity | - Critical | Alarm Severity | + | | - Major | | + | | - Minor | | + | | - Suggestion | | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Time Range | From 00:00 to 23:59 (example) | Specifies the period in which the rule takes effect. | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Threshold | 80 (example) | Specifies the threshold of the rule monitoring metrics. | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Date | - Workday | Specifies the type of date when the rule takes effect. | + | | - Weekend | | + | | - Other | | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add Date | 11/06 (example) | This parameter is valid only when **Date** is set to **Other**. You can select multiple dates. | + +-----------------------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. A message is displayed in the upper right corner of the page, indicating that the template is saved successfully. + + **Send alarm** is selected by default. MRS Manager checks whether the value of each monitored metric reaches the threshold. If the number of consecutive check times is equal to the value of **Trigger Count**, and the threshold is not reached in these checks, the system sends an alarm. The value can be customized. **Check Period (s)** indicates the interval at which MRS Manager checks monitoring metrics. + +#. Locate the row that contains the newly added rule, and click **Apply** in the **Operation** column. A message is displayed in the upper right corner, indicating that the rule *xx* is successfully added. Click **Cancel** in the **Operation** column. A message is displayed in the upper right corner, indicating that the rule *xx* is successfully canceled. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_snmp_northbound_interface_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_snmp_northbound_interface_parameters.rst new file mode 100644 index 0000000..e6b880e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_snmp_northbound_interface_parameters.rst @@ -0,0 +1,79 @@ +:original_name: mrs_01_0240.html + +.. _mrs_01_0240: + +Configuring SNMP Northbound Interface Parameters +================================================ + +Scenario +-------- + +You can configure the northbound interface so that alarms and monitoring metrics on MRS Manager can be integrated to the network management platform using SNMP. + +Prerequisites +------------- + +The ECS corresponding to the server must be in the same VPC as the Master node of the MRS cluster, and the Master node can access the IP address and specified port of the server. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In **Configuration**, click **Configure SNMP** under **Monitoring and Alarm**. + + The **SNMP Service** is disabled by default. Click the switch to enable the SNMP service. + +#. Set the interconnection parameters listed in :ref:`Table 1 `. + + .. _mrs_01_0240__en-us_topic_0035209607_table981749184027: + + .. table:: **Table 1** Syslog parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================+ + | Version | Specifies the version of the SNMP, which can be: | + | | | + | | - v2c: an earlier version with low security | + | | - v3: the latest version of SNMP with higher security than SNMPv2c | + | | | + | | The SNMP v3 version is recommended. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Local Port | Specifies the local port. The default value is **20000**. The value ranges from **1025** to **65535**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Read Community Name | Specifies the read-only community name. This parameter is valid only when **Version** is set to **v2c**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Write Community Name | Specifies the write community name. This parameter is valid only when **Version** is set to **v2c**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Security Username | Specifies the SNMP security username. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Authentication Protocol | Specifies the authentication protocol. You are advised to set this parameter to set this parameter to **SHA**. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Authentication Password | Specifies the authentication key. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Used to confirm the authentication key. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Encryption Protocol | Specifies the encryption protocol. You are advised to set this parameter to **AES256**. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Encryption Password | Specifies the encryption key. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Confirm Password | Used to confirm the encryption key. This parameter is valid only when **Version** is set to **v3**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + - The **Authentication Password** and **Encryption Password** must contain 8 to 16 characters, including at least three types of the following characters: uppercase letters, lowercase letters, digits, and special characters. The two passwords must be different. The two passwords cannot be the same as the security username or the reverse of the security username. + - For security purposes, periodically change the authentication password and encryption password when the SNMP protocol is used. + - If SNMPv3 is used, a security user will be locked after five consecutive authentication failures within 5 minutes. The user will be automatically unlocked 5 minutes later. + +#. Click **Create Trap Target** in the **Trap Target** area. In the displayed dialog box, set the following parameters: + + - **Target Symbol** specifies the trap target ID, which is the ID of the NMS or host that receives traps. The value consists of 1 to 255 characters, including letters or digits. + - **Target IP Address** specifies the IP address of the target trap. IP addresses of class A, B, and C can be used to communicate with the IP address of the management plane of the management node. + - **Target Port** specifies the port receiving traps. The port number must be consistent with the peer end and ranges from 0 to 65535. + - **Trap Community Name** is valid only when **Version** is set to **v2c**. + + Click **OK**. The **Create Trap Target** dialog box is closed. + +#. Click **OK** to complete the settings. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_syslog_northbound_interface_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_syslog_northbound_interface_parameters.rst new file mode 100644 index 0000000..7670727 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/configuring_syslog_northbound_interface_parameters.rst @@ -0,0 +1,89 @@ +:original_name: mrs_01_0239.html + +.. _mrs_01_0239: + +Configuring Syslog Northbound Interface Parameters +================================================== + +Scenario +-------- + +You can configure the northbound interface so that alarms generated on MRS Manager can be reported to your monitoring O&M system using Syslog. + +.. important:: + + If the Syslog protocol is not encrypted, data may be stolen. + +Prerequisites +------------- + +The ECS corresponding to the server must be in the same VPC as the Master node of the MRS cluster, and the Master node can access the IP address and specified port of the server. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In **Configuration**, click **Configure Syslog** under **Monitoring and Alarm**. + + The **Syslog Service** is disabled by default. Click the switch to enable the Syslog service. + +#. Set the interconnection parameters listed in :ref:`Table 1 `. + + .. _mrs_01_0239__en-us_topic_0035209606_table27202707183556: + + .. table:: **Table 1** Syslog parameters + + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Area | Parameter | Description | + +===========================+=================================+===========================================================================================================================================================================================================================================================================================================+ + | Syslog Protocol | Service IP Address | Specifies the IP address of the interconnection server. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Server Port | Specifies the port number for interconnection. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Protocol | Specifies the protocol type. The options are as follows: | + | | | | + | | | - **TCP** | + | | | - **UDP** | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Severity | Specifies the severity of the reported message. The options are as follows: | + | | | | + | | | - **Informational** | + | | | - **Emergency** | + | | | - **Alert** | + | | | - **Critical** | + | | | - **Error** | + | | | - **Warning** | + | | | - **Notice** | + | | | - **Debug** | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Facility | Specifies the module where the log is generated. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Identifier | Specifies the product ID. The default value is **MRS Manager**. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Report Message | Report Format | Specifies the message format of the alarm report. For details, see help information on the web page. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Alarm Status | Specifies the type of the alarm to be reported. | + | | | | + | | | - **Fault**: indicates that the Syslog alarm message is reported when MRS Manager generates an alarm. | + | | | - **Clear**: indicates that a Syslog alarm message is reported when an alarm on MRS Manager is cleared. | + | | | - **Event**: indicates that the Syslog alarm message is reported when MRS Manager generates an event. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Report Alarm Severity | Specifies the level of the alarm to be reported. The value can be **Suggestion**, **Minor**, **Major**, and **Critical**. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Uncleared Alarm Reporting | Periodic Uncleared Alarm Report | Specifies whether uncleared alarms are reported periodically. By default, the switch of **Periodic Uncleared Alarm Reporting** is disabled. You can click the switch to enable it. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Report Interval (min) | Specifies the interval for periodically reporting uncleared alarms to the remote Syslog service. This parameter is valid only when **Periodic Uncleared Alarm Reporting** switch is enabled. The unit is minute. The default value is **15**. The value ranges from 5 minutes to one day (1,440 minutes). | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Heartbeat Settings | Heartbeat Report | Specifies whether to periodically report Syslog heartbeat messages. By default, the switch of **Periodic Uncleared Alarm Reporting** is disabled. You can click the switch to enable it. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Heartbeat Period (min) | Specifies the interval for periodically reporting heartbeat messages. This parameter is valid only when **Heartbeat Report** switch is enabled. The unit is minute. The default value is **15**. The value ranges from 1 to 60. | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Heartbeat Packet | Specifies the content of the reported heartbeat message. This parameter is enabled when **Heartbeat Report** is enabled. The value can contain a maximum of 256 characters, including digits, letters, underscores (_), vertical bars (|), colons (:), spaces, commas (,), and periods (.). | + +---------------------------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + After the periodic heartbeat packet function is enabled, packets may be interrupted during automatic recovery of some cluster error tolerance (for example, active/standby management node switchover). In this case, wait for automatic recovery. + +#. Click **OK** to complete the settings. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/index.rst new file mode 100644 index 0000000..de7fa5b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0236.html + +.. _mrs_01_0236: + +Alarm Management +================ + +- :ref:`Viewing and Manually Clearing an Alarm ` +- :ref:`Configuring an Alarm Threshold ` +- :ref:`Configuring Syslog Northbound Interface Parameters ` +- :ref:`Configuring SNMP Northbound Interface Parameters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_and_manually_clearing_an_alarm + configuring_an_alarm_threshold + configuring_syslog_northbound_interface_parameters + configuring_snmp_northbound_interface_parameters diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/viewing_and_manually_clearing_an_alarm.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/viewing_and_manually_clearing_an_alarm.rst new file mode 100644 index 0000000..47e5ee3 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_management/viewing_and_manually_clearing_an_alarm.rst @@ -0,0 +1,55 @@ +:original_name: mrs_01_0237.html + +.. _mrs_01_0237: + +Viewing and Manually Clearing an Alarm +====================================== + +Scenario +-------- + +You can view and clear alarms on MRS Manager. + +Generally, the system automatically clears an alarm when the fault is rectified. If the fault has been rectified and the alarm cannot be automatically cleared, you can manually clear the alarm. + +You can view the latest 100,000 alarms (including uncleared, manually cleared, and automatically cleared alarms) on MRS Manager. If the number of cleared alarms exceeds 100,000 and is about to reach 110,000, the system automatically dumps the earliest 10,000 cleared alarms to **${BIGDATA_HOME}/OMSV100R001C00x8664/workspace/data** on the active management node. A directory is automatically generated when alarms are dumped for the first time. + +.. note:: + + Set an automatic refresh interval or click |image1| for an immediate refresh. + + The following refresh interval options are supported: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + +Procedure +--------- + +#. On MRS Manager, click **Alarms** to view the alarm information in the alarm list. + + - By default, the alarm list page displays the latest 10 alarms. + - By default, alarms are displayed in descending order by **Generated**. You can click **Alarm ID**, **Alarm Name**, **Severity**, **Generated**, **Location**, **Operation** to change the display mode. + - You can filter all alarms of the same severity in **Severity**, including cleared and uncleared alarms. + - You can click |image2|, |image3|, |image4|, or |image5| to filter out **Critical**, **Major**, **Minor**, or **Warning** alarms. + +2. Click **Advanced Search**. In the displayed alarm search area, set search criteria and click **Search** to view the information about specified alarms. Click **Reset** to clear the search criteria. + + .. note:: + + You can set the **Start Time** and **End Time** to specify the time range. You can search for alarms generated within the time range. + + Handle the alarm by referring to **Alarm Reference**. If the alarms in some scenarios are generated due to other cloud services that MRS depends on, you need to contact maintenance personnel of the corresponding cloud services. + +3. If the alarm needs to be manually cleared after errors are rectified, click **Clear Alarm**. + + .. note:: + + If multiple alarms have been handled, you can select one or more alarms to be cleared and click **Clear Alarm** to clear the alarms in batches. A maximum of 300 alarms can be cleared in each batch. + +.. |image1| image:: /_static/images/en-us_image_0000001348737925.png +.. |image2| image:: /_static/images/en-us_image_0000001348738141.jpg +.. |image3| image:: /_static/images/en-us_image_0000001295898280.jpg +.. |image4| image:: /_static/images/en-us_image_0000001349137829.jpg +.. |image5| image:: /_static/images/en-us_image_0000001349257417.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12001_audit_log_dump_failure.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12001_audit_log_dump_failure.rst new file mode 100644 index 0000000..3609d76 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12001_audit_log_dump_failure.rst @@ -0,0 +1,85 @@ +:original_name: alm_12001.html + +.. _alm_12001: + +ALM-12001 Audit Log Dump Failure +================================ + +Description +----------- + +Cluster audit logs need to be dumped on a third-party server due to the local historical data backup policy. Audit logs can be successfully dumped if the dump server meets the configuration conditions. This alarm is generated when the audit log dump fails because the disk space of the dump directory on the third-party server is insufficient or a user changes the username, password, or dump directory of the dump server. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12001 Minor Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system can only store a maximum of 50 dump files locally. If the fault persists on the dump server, the local audit log may be lost. + +Possible Causes +--------------- + +- The network connection is abnormal. +- The username, password, or dump directory of the dump server does not meet the configuration conditions. +- The disk space of the dump directory is insufficient. + +Procedure +--------- + +#. Check whether the username, password, and dump directory are correct. + + a. Check on the dump configuration page of MRS Manager to see if they are correct. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`1.b `. + + b. .. _alm_12001__en-us_topic_0191813935_li56668375121446: + + Change the username, password, or dump directory, and click **OK**. + + c. Wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12001__en-us_topic_0191813935_li5748176314415: + + Reset the dump rule. + + a. On MRS Manager, choose **System** > **Dump Audit Log**. + b. Reset dump rules, set the parameters properly, and click **OK**. + c. Wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12001__en-us_topic_0191813935_li2924012813025: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12002_ha_resource_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12002_ha_resource_is_abnormal.rst new file mode 100644 index 0000000..6c50588 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12002_ha_resource_is_abnormal.rst @@ -0,0 +1,148 @@ +:original_name: alm_12002.html + +.. _alm_12002: + +ALM-12002 HA Resource Is Abnormal +================================= + +Description +----------- + +The high availability (HA) software periodically checks the WebService floating IP addresses and databases of Manager. This alarm is generated when the HA software detects that the WebService floating IP addresses or databases are abnormal. + +This alarm is cleared when the HA software detects that the floating IP addresses or databases are normal. + +**Attribute** +------------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12002 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ======================================================== +Parameter Description +=========== ======================================================== +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +RESName Specifies the resource for which the alarm is generated. +=========== ======================================================== + +Impact on the System +-------------------- + +If the WebService floating IP addresses of Manager are abnormal, users cannot log in to or use Manager. If databases of Manager are abnormal, all core services and related service processes, such as alarms and monitoring functions, are affected. + +Possible Causes +--------------- + +- The floating IP address is abnormal. +- The database is abnormal. + +Procedure +--------- + +#. Check the floating IP address status of the active management node. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the host address and resource name of the alarm. + + b. Log in to the active management node. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + c. Go to the **${BIGDATA_HOME}/om-0.0.1/sbin/** directory, run the **status-oms.sh** script to check whether the floating IP address of the active Manager is normal. View the command output, locate the row where **ResName** is **floatip**, and check whether the following information is displayed. + + Example: + + .. code-block:: + + 10-10-10-160 floatip Normal Normal Single_active + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.d `. + + d. .. _alm_12002__en-us_topic_0191813914_li41799423131631: + + Contact the O&M personnel to check whether the floating IP NIC exists. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.e `. + + e. .. _alm_12002__en-us_topic_0191813914_li6978622131725: + + Contact O&M personnel to rectify the NIC fault. + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12002__en-us_topic_0191813914_li50663096131636: + + Check the database status of the active and standby management nodes. + + a. Log in to the active and standby management nodes, run the **sudo su - root** and **su - ommdba** commands to switch to user **ommdba**, and run the **gs_ctl query** command to check whether the following information is displayed in the command output. + + Command output of the active management node: + + .. code-block:: + + Ha state: + LOCAL_ROLE: Primary + STATIC_CONNECTIONS: 1 + DB_STATE: Normal + DETAIL_INFORMATION: user/password invalid + Senders info: + No information + Receiver info: + No information + + Command output of the standby management node: + + .. code-block:: + + Ha state: + LOCAL_ROLE: Standby + STATIC_CONNECTIONS: 1 + DB_STATE : Normal + DETAIL_INFORMATION: user/password invalid + Senders info: + No information + Receiver info: + No information + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`2.b `. + + b. .. _alm_12002__en-us_topic_0191813914_li40232703142216: + + Contact the O&M personnel to check whether a network fault occurs and rectify the fault. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_12002__en-us_topic_0191813914_li55696398142240: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12002__en-us_topic_0191813935_li2924012813025: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12004_oldap_resource_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12004_oldap_resource_is_abnormal.rst new file mode 100644 index 0000000..ab143d4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12004_oldap_resource_is_abnormal.rst @@ -0,0 +1,79 @@ +:original_name: alm_12004.html + +.. _alm_12004: + +ALM-12004 OLdap Resource Is Abnormal +==================================== + +Description +----------- + +This alarm is generated when the Ldap resource in Manager is abnormal. + +This alarm is cleared when the Ldap resource in Manager recovers and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12004 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Manager authentication services are unavailable and cannot provide security authentication and user management functions for web upper-layer services. Users may be unable to log in to Manager. + +Possible Causes +--------------- + +The LdapServer process in Manager is abnormal. + +Procedure +--------- + +#. Check whether the LdapServer process in Manager is normal. + + a. Log in to the active management node. + + b. Run **ps -ef \| grep slapd** to check whether the LdapServer resource process in the **${BIGDATA_HOME}/om-0.0.1/** directory of the configuration file is running properly. + + You can determine that the resource is normal as follows: + + #. Run **sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh** and find that **ResHAStatus** of the OLdap process is **Normal**. + #. Run **ps -ef \| grep slapd** and find that the slapd process occupies port 21750. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _alm_12004__en-us_topic_0191813880_li15577384153414: + + Run **kill -2** *PID of the LdapServer process* and wait 20 seconds. The HA starts the OLdap process automatically. Check whether the status of the OLdap resource is normal. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12004__en-us_topic_0191813935_li2924012813025: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12005_okerberos_resource_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12005_okerberos_resource_is_abnormal.rst new file mode 100644 index 0000000..69b5103 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12005_okerberos_resource_is_abnormal.rst @@ -0,0 +1,78 @@ +:original_name: alm_12005.html + +.. _alm_12005: + +ALM-12005 OKerberos Resource Is Abnormal +======================================== + +Description +----------- + +The alarm module monitors the status of the Kerberos resource in Manager. This alarm is generated when the Kerberos resource is abnormal. + +This alarm is cleared when the alarm handling is complete and the Kerberos resource status recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12005 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The authentication services are unavailable and cannot provide security authentication functions for web upper-layer services. Users may be unable to log in to MRS Manager. + +Possible Causes +--------------- + +The OLdap resource on which OKerberos depends is abnormal. + +Procedure +--------- + +#. Check whether the OLdap resource on which OKerberos depends is abnormal in Manager. + + a. Log in to the active management node. + + b. Run the following command to check whether the OLdap resource managed by HA is normal: + + **sh ${BIGDATA_HOME}/OMSV100R001C00x8664/workspace0/ha/module/hacom/script/status_ha.sh** + + The OLdap resource is normal when the OLdap resource is in the **Active_normal** state on the active node and in the **Standby_normal** state on the standby node. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12005__en-us_topic_0191813966_li29509559161240: + + Resolve the problem by following the instructions in :ref:`ALM-12004 OLdap Resource Is Abnormal `. After the OLdap resource status recovers, check whether the OKerberos resource is normal. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12005__en-us_topic_0191813966_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12006_node_fault.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12006_node_fault.rst new file mode 100644 index 0000000..7f785b6 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12006_node_fault.rst @@ -0,0 +1,97 @@ +:original_name: alm_12006.html + +.. _alm_12006: + +ALM-12006 Node Fault +==================== + +Description +----------- + +Controller checks the NodeAgent status every 30 seconds. This alarm is generated when Controller fails to receive the status report of a NodeAgent for three consecutive times. + +This alarm is cleared when Controller can properly receive the status report of the NodeAgent. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12006 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Services on the node are unavailable. + +Possible Causes +--------------- + +The network is disconnected, or the hardware is faulty. + +Procedure +--------- + +#. Check whether the network is disconnected or the hardware is faulty. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the host address of the alarm. + + b. Log in to the active management node. + + c. Run the following command to check whether the faulty node is reachable: + + **ping** *IP address of the faulty host* + + #. If yes, go to :ref:`2 `. + #. If no, go to :ref:`1.d `. + + d. .. _alm_12006__en-us_topic_0191813934_li65085062161917: + + Contact the O&M personnel to check whether the network is faulty. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.f `. + + e. Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`1.f `. + + f. .. _alm_12006__en-us_topic_0191813934_li25618036162125: + + Contact the O&M personnel to check whether a hardware fault (for example, a CPU or memory fault) occurs on the node. + + - If yes, go to :ref:`1.g `. + - If no, go to :ref:`2 `. + + g. .. _alm_12006__en-us_topic_0191813934_li8903046162132: + + Repair the faulty components and restart the node. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12006__en-us_topic_0191813934_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12007_process_fault.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12007_process_fault.rst new file mode 100644 index 0000000..5ce6d2c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12007_process_fault.rst @@ -0,0 +1,122 @@ +:original_name: alm_12007.html + +.. _alm_12007: + +ALM-12007 Process Fault +======================= + +Description +----------- + +The process health check module checks the process status every 5 seconds. This alarm is generated when the process health check module detects that the process connection status is Bad for three consecutive times. + +This alarm is cleared when the process can be connected. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12007 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The service provided by the process is unavailable. + +Possible Causes +--------------- + +- The instance process is abnormal. +- The drive space is insufficient. + +Procedure +--------- + +#. Check whether the instance process is abnormal. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the host name and service name of the alarm. + + b. On the **Alarms** page, check whether the alarm :ref:`ALM-12006 Node Fault ` is generated. + + If yes, go to :ref:`1.c `. + + If no, go to :ref:`1.d `. + + c. .. _alm_12007__en-us_topic_0191813896_li2911734163437: + + Handle the alarm by following the instructions in :ref:`ALM-12006 Node Fault `. + + d. .. _alm_12007__en-us_topic_0191813896_li13866005163437: + + Check whether the installation directory user, user group, and permission of the alarm role are correct. The correct user, user group, and the permission are **omm**, **ficommon**, and **750**, respectively. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`1.e `. + + e. .. _alm_12007__en-us_topic_0191813896_li56651933164749: + + Run the following commands to set the permission to **750** and **User:Group** to **omm:ficommon**: + + **chmod 750** ** + + **chown omm:ficommon** ** + + f. .. _alm_12007__en-us_topic_0191813896_li46518721164818: + + Wait 5 minutes and check whether the ALM-12007 Process Fault alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check whether the disk space is insufficient. + + a. .. _alm_12007__en-us_topic_0191813896_li1779806016495: + + On the MRS cluster details page, click the alarm management tab and check whether ALM-12017 Insufficient Disk Capacity is generated in the alarm list. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3 `. + + b. .. _alm_12007__en-us_topic_0191813896_li41496976164852: + + Handle the alarm by following the instructions in :ref:`ALM-12017 Insufficient Disk Capacity `. + + c. Wait 5 minutes and check whether the ALM-12017 Insufficient Disk Capacity alarm is cleared. + + If yes, go to :ref:`2.d `. + + If no, go to :ref:`3 `. + + d. .. _alm_12007__en-us_topic_0191813896_li33899481164916: + + Wait 5 minutes and check whether the alarm is cleared. + + If yes, no further action is required. + + If no, go to :ref:`3 `. + +#. .. _alm_12007__en-us_topic_0191813896_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst new file mode 100644 index 0000000..29157bb --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes.rst @@ -0,0 +1,97 @@ +:original_name: alm_12010.html + +.. _alm_12010: + +ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes +============================================================================= + +Description +----------- + +This alarm is generated when the active Manager does not receive any heartbeat signal from the standby Manager within 7 seconds. + +This alarm is cleared when the active Manager receives heartbeat signals from the standby Manager. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12010 Major Yes +======== ============== ========== + +Parameters +---------- + ++-----------------------+---------------------------------------------------------+ +| Parameter | Description | ++=======================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| Local Manager HA Name | Specifies a local Manager HA. | ++-----------------------+---------------------------------------------------------+ +| Peer Manager HA Name | Specifies a peer Manager HA. | ++-----------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +When the active Manager process is abnormal, an active/standby failover cannot be performed, and services are affected. + +Possible Causes +--------------- + +The link between the active and standby Manager servers is abnormal. + +Procedure +--------- + +#. Check whether the network between the active and standby Manager servers is normal. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the address of the standby Manager server. + + b. Log in to the active management node. + + c. Run the following command to check whether the standby Manager is reachable: + + **ping** *heartbeat IP address of the standby Manager* + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.d `. + + d. .. _alm_12010__en-us_topic_0191813932_li233941717940: + + Contact the O&M personnel to check whether the network is faulty. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_12010__en-us_topic_0191813932_li4279289717106: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12010__li7265151273516: + + Log in to all master nodes in the cluster and run the following commands to find all **sed**\ *xxx* files and delete them: + + **find /srv/BigData/ -name "sed*"** + + **find /opt -name "sed*"** + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12011_data_synchronization_exception_between_the_active_and_standby_manager_nodes.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12011_data_synchronization_exception_between_the_active_and_standby_manager_nodes.rst new file mode 100644 index 0000000..24bfc1d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12011_data_synchronization_exception_between_the_active_and_standby_manager_nodes.rst @@ -0,0 +1,89 @@ +:original_name: alm_12011.html + +.. _alm_12011: + +ALM-12011 Data Synchronization Exception Between the Active and Standby Manager Nodes +===================================================================================== + +Description +----------- + +This alarm is generated when the standby Manager fails to synchronize files with the active Manager. + +This alarm is cleared when the standby Manager synchronizes files with the active Manager. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12011 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-----------------------+---------------------------------------------------------+ +| Parameter | Description | ++=======================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-----------------------+---------------------------------------------------------+ +| Local Manager HA Name | Specifies a local Manager HA. | ++-----------------------+---------------------------------------------------------+ +| Peer Manager HA Name | Specifies a peer Manager HA. | ++-----------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Because the configuration files on the standby Manager are not updated, some configurations will be lost after an active/standby switchover. Manager and some components may not run properly. + +Possible Causes +--------------- + +The link between the active and standby Manager nodes is interrupted. + +Procedure +--------- + +#. Check whether the network between the active and standby Manager servers is normal. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the address of the standby Manager server. + + b. Log in to the active management node. Run the following command to check whether the standby Manager is reachable: + + **ping** *IP address of the standby Manager* + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.c `. + + c. .. _alm_12011__en-us_topic_0191813887_li47267615172220: + + Contact the O&M personnel to check whether the network is faulty. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2 `. + + d. .. _alm_12011__en-us_topic_0191813887_li37136917172238: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12011__en-us_topic_0191813887_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12012_ntp_service_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12012_ntp_service_is_abnormal.rst new file mode 100644 index 0000000..4da9416 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12012_ntp_service_is_abnormal.rst @@ -0,0 +1,158 @@ +:original_name: alm_12012.html + +.. _alm_12012: + +ALM-12012 NTP Service Is Abnormal +================================= + +Description +----------- + +This alarm is generated when the NTP service on the current node fails to synchronize time with the NTP service on the active OMS node. + +This alarm is cleared when the NTP service on the current node synchronizes time properly with the NTP service on the active OMS node. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12012 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The time on the node is inconsistent with that on other nodes in the cluster. Therefore, some MRS applications on the node may not run properly. + +Possible Causes +--------------- + +- The NTP service on the current node cannot start properly. +- The current node fails to synchronize time with the NTP service on the active OMS node. +- The key value authenticated by the NTP service on the current node is inconsistent with that on the active OMS node. +- The time offset between the node and the NTP service on the active OMS node is large. + +Procedure +--------- + +#. Check the NTP service on the current node. + + a. Check whether the ntpd process is running on the node using the following method. Log in to the node for which the alarm is generated and run the **sudo su - root** command to switch to user **root**. Then run the following command to check whether the command output contains the ntpd process: + + **ps -ef \| grep ntpd \| grep -v grep** + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.b `. + + b. .. _alm_12012__en-us_topic_0191813955_li6445073917350: + + Run **service ntp start** to start the NTP service. + + c. Wait 10 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check whether the current node can synchronize time properly with the NTP service on the active OMS node. + + a. .. _alm_12012__en-us_topic_0191813955_li64213271174322: + + Check whether the node can synchronize time with the NTP service on the active OMS node based on additional information of the alarm. + + If yes, go to :ref:`2.b `. + + If no, go to :ref:`3 `. + + b. .. _alm_12012__en-us_topic_0191813955_li14178567173544: + + Check whether the synchronization with the NTP service on the active OMS node is faulty. + + Log in to the node for which the alarm is generated, run the **sudo su - root** command to switch to user **root**, and run the **ntpq -np** command. + + If an asterisk (``*``) exists before the IP address of the NTP service on the active OMS node in the command output, the synchronization is in normal state. The command output is as follows: + + .. code-block:: + + remote refid st t when poll reach delay offset jitter + ============================================================================== + *10.10.10.162 .LOCL. 1 u 1 16 377 0.270 -1.562 0.014 + + If there is no asterisk (``*``) before the IP address of the NTP service on the active OMS node, as shown in the following command output, and the value of **refid** is **.INIT.**, the synchronization is abnormal. + + .. code-block:: + + remote refid st t when poll reach delay offset jitter + ============================================================================== + 10.10.10.162 .INIT. 1 u 1 16 377 0.270 -1.562 0.014 + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_12012__en-us_topic_0191813955_li25713785173557: + + Rectify the fault, wait 10 minutes, and then check whether the alarm is cleared. + + An NTP synchronization failure is usually related to the system firewall. If the firewall can be disabled, disable it and then check whether the fault is rectified. If the firewall cannot be disabled, check the firewall configuration policies and ensure that port **UDP 123** is enabled (you need to follow specific firewall configuration policies of each system). + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12012__en-us_topic_0191813955_li65673991173316: + + Check whether the key value authenticated by the NTP service on the current node is consistent with that on the active OMS node. + + Run **cat** **/etc/ntp.keys/etc/ntp/ntpkeys** to check whether the authentication code whose key value index is 1 is the same as the value of the NTP service on the active OMS node. + + - If yes, go to :ref:`4.a `. + - If no, go to :ref:`5 `. + +#. Check whether the time offset between the node and the NTP service on the active OMS node is large. + + a. .. _alm_12012__en-us_topic_0191813955_li50308011174636: + + Check whether the time offset is large in additional information of the alarm. + + - If yes, go to :ref:`4.b `. + - If no, go to :ref:`5 `. + + b. .. _alm_12012__en-us_topic_0191813955_li25272675173633: + + On the **Hosts** page, select the host of the node, and choose **More** > **Stop All Roles** to stop all the services on the node. + + If the time on the alarm node is later than that on the NTP service of the active OMS node, adjust the time of the alarm node. After adjusting the time, choose **More** > **Start All Roles** to start the services on the node. + + If the time on the alarm node is earlier than that on the NTP service of the active OMS node, wait until the time offset is due and adjust the time of the alarm node. After adjusting the time, choose **More** > **Start All Roles** to start the services on the node. + + .. note:: + + If you do not wait, data loss may occur. + + c. Wait 10 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_12012__en-us_topic_0191813955_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12014_device_partition_lost.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12014_device_partition_lost.rst new file mode 100644 index 0000000..a6afa2a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12014_device_partition_lost.rst @@ -0,0 +1,100 @@ +:original_name: alm_12014.html + +.. _alm_12014: + +ALM-12014 Device Partition Lost +=============================== + +Description +----------- + +This alarm is generated when the system detects that a partition to which service directories are mounted is lost (because the device is removed or goes offline, or the partition is deleted). The system checks the partition status periodically. + +This alarm needs to be cleared manually. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12014 Major No +======== ============== ========== + +Parameters +---------- + ++---------------+------------------------------------------------------------------+ +| Parameter | Description | ++===============+==================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| DirName | Specifies the directory for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| PartitionName | Specifies the device partition for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service data fails to be written into the partition, and the service system runs abnormally. + +Possible Causes +--------------- + +- The disk is removed. +- The disk is offline, or a bad sector exists on the disk. + +Procedure +--------- + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. In the real-time alarm list, click the row that contains the alarm. + +#. In the **Alarm Details** area, obtain the values of **HostName**, **PartitionName**, and **DirName** from **Location**. + +#. Check whether the disk corresponding to **PartitionName** on **HostName** is inserted to the correct server slot. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm_12014__en-us_topic_0191813950_li1456274715359: + + Contact hardware engineers to remove the faulty disk. + +#. .. _alm_12014__en-us_topic_0191813950_li6395586615359: + + Use PuTTY to log in to the **HostName** node where an alarm is reported and check whether there is a line containing **DirName** in the **/etc/fstab** file. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +#. .. _alm_12014__en-us_topic_0191813950_li921471615359: + + Run the **vi /etc/fstab** command to edit the file and delete the line containing **DirName**. + +#. .. _alm_12014__en-us_topic_0191813950_li819455015359: + + Contact hardware engineers to insert a new disk. For details, see the hardware product document of the relevant model. If the faulty disk is in a RAID group, configure the RAID group. For details, see the configuration methods of the relevant RAID controller card. + +#. Wait 20 to 30 minutes (The disk size determines the waiting time), and run the **mount** command to check whether the disk has been mounted to the **DirName** directory. + + - If yes, manually clear the alarm. No further operation is required. + - If no, go to :ref:`10 `. + +#. .. _alm_12014__en-us_topic_0191813950_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12015_device_partition_file_system_read-only.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12015_device_partition_file_system_read-only.rst new file mode 100644 index 0000000..b808086 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12015_device_partition_file_system_read-only.rst @@ -0,0 +1,63 @@ +:original_name: alm_12015.html + +.. _alm_12015: + +ALM-12015 Device Partition File System Read-Only +================================================ + +Description +----------- + +This alarm is generated when the system detects that a partition to which service directories are mounted enters the read-only mode (due to a bad sector or a faulty file system). The system checks the partition status periodically. + +This alarm is cleared when the system detects that the partition to which service directories are mounted exits from the read-only mode (because the file system is restored to read/write mode, the device is removed, or the device is formatted). + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12015 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+------------------------------------------------------------------+ +| Parameter | Description | ++===============+==================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| DirName | Specifies the directory for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ +| PartitionName | Specifies the device partition for which the alarm is generated. | ++---------------+------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service data fails to be written into the partition, and the service system runs abnormally. + +Possible Causes +--------------- + +The disk is faulty, for example, a bad sector exists. + +Procedure +--------- + +#. Go to the MRS cluster details page and choose **Alarms**. +#. In the real-time alarm list, click the row that contains the alarm. +#. In the **Alarm Details** area, obtain **HostName** and **PartitionName** from **Location**. **HostName** indicates the node for which the alarm is generated, and **PartitionName** indicates the partition of the faulty disk. +#. Contact hardware engineers to check whether the disk is faulty. If the disk is faulty, remove it from the server. +#. After the disk is removed, the system reports ALM-12014 Partition Lost. Handle the alarm by following the instructions in :ref:`ALM-12014 Device Partition Lost `. After the handling, the alarm is automatically cleared. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..8dd89cd --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12016_cpu_usage_exceeds_the_threshold.rst @@ -0,0 +1,91 @@ +:original_name: alm_12016.html + +.. _alm_12016: + +ALM-12016 CPU Usage Exceeds the Threshold +========================================= + +Description +----------- + +The system checks the CPU usage every 30 seconds and compares the check result with the default threshold. The CPU usage has a default threshold. This alarm is generated when the CPU usage exceeds the threshold for several times (configurable, 10 times by default) consecutively. + +This alarm is cleared when the average CPU usage is less than or equal to 90% of the threshold. + +**Attribute** +------------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12016 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Processes respond slowly or do not work. + +Possible Causes +--------------- + +- The alarm threshold or alarm hit number is improperly configured. +- The CPU configuration cannot meet service requirements. The CPU usage reaches the upper limit. + +Procedure +--------- + +#. Check whether the alarm threshold or alarm hit number is properly configured. + + a. Log in to MRS Manager and change the alarm threshold and alarm hit number based on CPU usage. + b. Choose **System** > **Threshold Configuration** > **Device** > **Host** > **CPU** > **CPU Usage** > **CPU Usage** and change the alarm threshold based on the actual CPU usage. + c. Choose **System** > **Threshold Configuration** > **Device** > **Host** > **CPU** > **CPU Usage** > **CPU Usage** and change **hit number** based on the actual CPU usage. + + .. note:: + + This option defines the alarm check phase. **Interval** indicates the alarm check period and **hit number** indicates the number of times when the CPU usage exceeds the threshold. An alarm is generated when the CPU usage exceeds the threshold for several times consecutively. + + d. Wait 2 minutes and check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12016__en-us_topic_0191813922_li23374914104744: + + Expand the system. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the address of the node. + b. Log in to the node for which the alarm is generated. + c. Run **cat /proc/stat \| awk 'NR==1'|awk '{for(i=2;i<=NF;i++)j+=$i;print "" 100 - ($5+$6) \* 100 / j;}'** to check the system CPU usage. + d. If the CPU usage exceeds the threshold, expand the CPU capacity. + e. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12016__en-us_topic_0191813922_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12017_insufficient_disk_capacity.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12017_insufficient_disk_capacity.rst new file mode 100644 index 0000000..dffa8db --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12017_insufficient_disk_capacity.rst @@ -0,0 +1,138 @@ +:original_name: alm_12017.html + +.. _alm_12017: + +ALM-12017 Insufficient Disk Capacity +==================================== + +Description +----------- + +The system checks the host disk usage every 30 seconds and compares the actual disk usage with the threshold. The disk usage has a default threshold. This alarm is generated if the disk usage exceeds the threshold. + +To change the threshold, choose **System** > **Threshold Configuration**. + +This alarm is cleared when the host disk usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12017 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| PartitionName | Specifies the disk partition for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Service processes become unavailable. + +Possible Causes +--------------- + +The disk configuration cannot meet service requirements. The disk usage reaches the upper limit. + +Procedure +--------- + +#. Log in to MRS Manager and check whether the threshold is appropriate. + + a. The default threshold is 90%. You can change the threshold to meet service requirements. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.b `. + + b. .. _alm_12017__en-us_topic_0191813938_li39303074103210: + + Choose **System** > **Threshold Configuration** and change the alarm threshold based on the actual disk usage. + + c. Wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12017__en-us_topic_0191813938_li1589085510271: + + Check whether the disk is a system disk. + + a. .. _alm_12017__en-us_topic_0191813938_li4203435103158: + + Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the host name and disk partition information. + + b. Log in to the node for which the alarm is generated. + + c. Run the **df -h** command to check the system disk partition usage. Check whether the disk is mounted to any of the following directories by using the disk partition name obtained in :ref:`2.a `: **/**, **/boot**, **/home**, **/opt**, **/tmp**, **/var**, **/var/log**, **/boot**, and **/srv/BigData**. + + - If yes, the disk is a system disk. Then go to :ref:`3.a `. + - If no, the disk is not a system disk. Then go to :ref:`2.d `. + + d. .. _alm_12017__en-us_topic_0191813938_li22825392103158: + + Run the **df -h** command to check the system disk partition usage. Determine the role of the disk based on the disk partition name obtained in :ref:`2.a `. + + e. Check whether the disk is used by HDFS or Yarn. + + - If yes, expand the disk capacity for the Core node. Then go to :ref:`2.f `. + - If no, go to :ref:`4 `. + + f. .. _alm_12017__en-us_topic_0191813938_li23401589103652: + + Wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12017__en-us_topic_0191813938_li1854606410341: + + Check whether large files are written to the disk. + + a. .. _alm_12017__en-us_topic_0191813938_li3904890010377: + + Run the **find / -xdev -size +500M -exec ls -l {} \\;** command to view files larger than 500 MB on the node. Check whether such files are written to the disk. + + - If yes, go to :ref:`3.b `. + - If no, go to :ref:`4 `. + + b. .. _alm_12017__en-us_topic_0191813938_li65656242103715: + + Handle the large files and check whether the alarm is cleared 2 minutes later. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + + c. Expand the disk capacity. + + d. Wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_12017__en-us_topic_0191813938_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..4be22fe --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12018_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,73 @@ +:original_name: alm_12018.html + +.. _alm_12018: + +ALM-12018 Memory Usage Exceeds the Threshold +============================================ + +Description +----------- + +The system checks the memory usage every 30 seconds and compares the actual memory usage with the threshold. The memory usage has a default threshold. This alarm is generated when the detected memory usage exceeds the threshold. + +This alarm is cleared when the host memory usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12018 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Processes respond slowly or do not work. + +Possible Causes +--------------- + +Memory configuration cannot meet service requirements. The memory usage reaches the upper threshold. + +Procedure +--------- + +#. Expand the system. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the host address of the alarm. + b. Log in to the node for which the alarm is generated. + c. Run **free -m \| grep Mem\\: \| awk '{printf("%s,", ($3-$6-$7) \* 100 / $2)}'** to check the system memory usage. + d. If the memory usage exceeds the threshold, expand the memory capacity. + e. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12018__en-us_topic_0191813923_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..366d9e8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12027_host_pid_usage_exceeds_the_threshold.rst @@ -0,0 +1,106 @@ +:original_name: alm_12027.html + +.. _alm_12027: + +ALM-12027 Host PID Usage Exceeds the Threshold +============================================== + +Description +----------- + +The system checks the PID usage every 30 seconds and compares the actual PID usage with the default threshold. This alarm is generated when the PID usage exceeds the threshold. + +This alarm is cleared when the host PID usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12027 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +No PID is available for new processes and service processes are unavailable. + +Possible Causes +--------------- + +Too many processes are running on the node. You need to increase the value of **pid_max**. The system is abnormal. + +Procedure +--------- + +#. Increase the value of **pid_max**. + + a. On the MRS cluster details page, click the alarm from the real-time alarm list. In the **Alarm Details** area, obtain the IP address of the host for which the alarm is generated. + + b. Log in to the node for which the alarm is generated. + + c. Run the **cat /proc/sys/kernel/pid_max** command to check the value of **pid_max**. + + d. If the PID usage exceeds the threshold, run the following command to double the value of **pid_max**: + + **echo** *New pid_max value* **> /proc/sys/kernel/pid_max** + + Example: + + **echo 65536 > /proc/sys/kernel/pid_max** + + .. note:: + + The maximum value of **pid_max** is as follows: + + - 32-bit OS: **32768** + - 64-bit OS: **4194304** (22nd power of 2) + + e. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12027__en-us_topic_0191813904_li6558431911107: + + Check whether the system environment is abnormal. + + a. Contact the O&M personnel to check whether the operating system is abnormal. + + - If yes, rectify the operating system fault and go to :ref:`2.b `. + - If no, go to :ref:`3 `. + + b. .. _alm_12027__en-us_topic_0191813904_li48344862111230: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12027__en-us_topic_0191813904_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_the_host_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_the_host_exceeds_the_threshold.rst new file mode 100644 index 0000000..1948ce8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12028_number_of_processes_in_the_d_state_on_the_host_exceeds_the_threshold.rst @@ -0,0 +1,96 @@ +:original_name: alm_12028.html + +.. _alm_12028: + +ALM-12028 Number of Processes in the D State on the Host Exceeds the Threshold +============================================================================== + +Description +----------- + +The system periodically checks the number of D state processes of user **omm** on the host every 30 seconds and compares the number with the threshold. The number of processes in the D state on the host has a default threshold. This alarm is generated when the number of processes in the D state exceeds the threshold. + +This alarm is cleared when the number is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12028 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Excessive system resources are used and the service process responds slowly. + +Possible Causes +--------------- + +The host responds slowly to I/O (disk I/O and network I/O) requests and a process is in the D state. + +**Procedure** +------------- + +#. Check the process that is in the D state. + + a. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the address of the host. + + b. Log in to the node for which the alarm is generated. + + c. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + d. Run the following command as user **omm** to view the PID of the process that is in the D state: + + **ps -elf \| grep -v "\\[thread_checkio\\]" \| awk 'NR!=1 {print $2, $3, $4}' \| grep omm \| awk -F' ' '{print $1, $3}' \| grep D \| awk '{print $2}'** + + e. Check whether the command output is empty. + + - If yes, the service process is running properly. Then go to :ref:`1.g `. + - If no, go to :ref:`1.f `. + + f. .. _alm_12028__en-us_topic_0191813960_li3581599211204: + + Switch to user **root** and run the **reboot** command to restart the alarm host. + + Restarting the host brings certain risks. Ensure that the service process runs properly after the restart. + + g. .. _alm_12028__en-us_topic_0191813960_li49836187112011: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12028__en-us_topic_0191813960_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12031_user_omm_or_password_is_about_to_expire.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12031_user_omm_or_password_is_about_to_expire.rst new file mode 100644 index 0000000..b7fdf5d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12031_user_omm_or_password_is_about_to_expire.rst @@ -0,0 +1,95 @@ +:original_name: alm_12031.html + +.. _alm_12031: + +ALM-12031 User omm or Password Is About to Expire +================================================= + +Description +----------- + +The system starts at 00:00 every day to check whether user **omm** and the password are about to expire every eight hours. This alarm is generated if the user or password is about to expire in 15 days. + +The alarm is cleared when the validity period of user **omm** is changed or the password is reset and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12031 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The node trust relationship is unavailable and Manager cannot manage the services. + +Possible Causes +--------------- + +User **omm** or the password is about to expire. + +Procedure +--------- + +#. Check whether user **omm** and the password in the system are valid. + + a. Log in to the faulty node. + + b. Run the following command to view the information about user **omm** and the password: + + **chage -l omm** + + c. Check whether the user has expired based on the system message. + + #. View the value of **Password expires** to check whether the password is about to expire. + #. View the value of **Account expires** to check whether the user is about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2 `. + + d. .. _alm_12031__en-us_topic_0191813902_li2310249112814: + + Run the following command to modify the validity period configuration: + + - Run the following command to set a validity period for user **omm**: + + **chage -E** *'specified date'* **omm** + + - Run the following command to set the number of validity days for user **omm**: + + **chage -M** *'number of days'* **omm** + + e. Check whether the alarm is cleared automatically in the next periodic check. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12031__en-us_topic_0191813902_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12032_user_ommdba_or_password_is_about_to_expire.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12032_user_ommdba_or_password_is_about_to_expire.rst new file mode 100644 index 0000000..7dda833 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12032_user_ommdba_or_password_is_about_to_expire.rst @@ -0,0 +1,95 @@ +:original_name: alm_12032.html + +.. _alm_12032: + +ALM-12032 User ommdba or Password Is About to Expire +==================================================== + +Description +----------- + +The system starts at 00:00 every day to check whether user **ommdba** and the password are about to expire every eight hours. This alarm is generated if the user or password is about to expire in 15 days. + +The alarm is cleared when the validity period of user **ommdba** is changed or the password is reset and the alarm handling is complete. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12032 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The OMS database cannot be managed and data cannot be accessed. + +Possible Causes +--------------- + +User **ommdba** or the password is about to expire. + +Procedure +--------- + +#. Check whether user **ommdba** and the password in the system are valid. + + a. Log in to the faulty node. + + b. Run the following command to view the information about user **ommdba** and the password: + + **chage -l ommdba** + + c. Check whether the user has expired based on the system message. + + #. View the value of **Password expires** to check whether the password is about to expire. + #. View the value of **Account expires** to check whether the user is about to expire. + + .. note:: + + If the parameter value is **never**, the user and password are valid permanently; if the value is a date, check whether the user and password are about to expire within 15 days. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2 `. + + d. .. _alm_12032__en-us_topic_0191813868_li2310249112814: + + Run the following command to modify the validity period configuration: + + - Run the following command to set a validity period for user **ommdba**: + + **chage -E** *'specified date'* **ommdba** + + - Run the following command to set the number of validity days for user **ommdba**: + + **chage -M** *'number of days'* **ommdba** + + e. Check whether the alarm is cleared automatically in the next periodic check. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12032__en-us_topic_0191813868_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +**Reference** +------------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12033_slow_disk_fault.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12033_slow_disk_fault.rst new file mode 100644 index 0000000..316cd40 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12033_slow_disk_fault.rst @@ -0,0 +1,57 @@ +:original_name: alm_12033.html + +.. _alm_12033: + +ALM-12033 Slow Disk Fault +========================= + +Description +----------- + +The system runs the **iostat** command every second to monitor the disk I/O indicator. If there are more than 30 times that the **svctm** value is greater than 100 ms in 60 seconds, the disk is faulty and the alarm is generated. + +This alarm is automatically cleared after the disk is replaced. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12033 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +DiskName Specifies the disk for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Service performance deteriorates and service processing capabilities become poor. For example, DBService active/standby synchronization is affected and even the service is unavailable. + +Possible Causes +--------------- + +The disk is aged or has bad sectors. + +Procedure +--------- + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12034_periodic_backup_failure.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12034_periodic_backup_failure.rst new file mode 100644 index 0000000..95b8402 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12034_periodic_backup_failure.rst @@ -0,0 +1,116 @@ +:original_name: alm_12034.html + +.. _alm_12034: + +ALM-12034 Periodic Backup Failure +================================= + +Description +----------- + +This alarm is generated when a periodic backup task fails to be executed. This alarm is cleared when the next backup task is executed successfully. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12034 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +TaskName Specifies the task name. +=========== ======================================================= + +Impact on the System +-------------------- + +No backup package is available for a long time, so the system cannot be restored in case of exceptions. + +Possible Causes +--------------- + +The alarm cause depends on the task details. Handle the alarm according to the logs and alarm details. + +Procedure +--------- + +**Checking whether the disk space is insufficient** + +#. On MRS Manager, choose **Alarms**. + +#. In the alarm list, click |image1| of the alarm and obtain the task name from the **Location** area. + +#. Choose **System** > **Back Up Data**. + +#. Search for the backup task based on the task name and choose **More** > **View History** in the **Operation** column to view detailed information about the backup task. + +#. Choose **Details** > **View** and check whether message "Failed to backup xx due to insufficient disk space, move the data in the /srv/BigData/LocalBackup directory to other directories." exists. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`13 `. + +#. .. _alm_12034__li347718556387: + + Choose **Backup Path** > **View** to obtain the backup path. + +#. Log in to the node as user **root** and view the mounting details of the node. + + **df -h** + +#. Check whether the available space of the node to which the backup path is mounted is less than 20 GB. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`13 `. + +#. .. _alm_12034__li74787554389: + + Check whether the backup package exists in the backup directory and whether the available space of the node to which the backup directory is mounted is less than 20 GB. + + - If yes, go to :ref:`10 `. + - If no, go to :ref:`13 `. + +#. .. _alm_12034__li1847855563815: + + Ensure that the available space of the node to which the backup directory is mounted to be greater than 20 GB by moving backup packages out of the backup directory or deleting the backup packages. + +#. Start the backup task again and check whether the backup task is executed. + + - If yes, go to :ref:`12 `. + - If no, go to :ref:`13 `. + +#. .. _alm_12034__li104790555386: + + After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`13 `. + +**Collecting fault information** + +13. .. _alm_12034__li164761155133811: + + On MRS Manager, choose **System** > **Export Log**. + +14. Contact technical support engineers for help. For details, see `technical support `__. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001296058020.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst new file mode 100644 index 0000000..46f558c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12035_unknown_data_status_after_recovery_task_failure.rst @@ -0,0 +1,90 @@ +:original_name: alm_12035.html + +.. _alm_12035: + +ALM-12035 Unknown Data Status After Recovery Task Failure +========================================================= + +Description +----------- + +If a recovery task fails, the system attempts to automatically roll back. If the rollback fails, data may be lost. If this occurs, an alarm is reported. This alarm is cleared when the recovery task is successfully executed later. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12035 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +TaskName Specifies the task name. +=========== ======================================================= + +Impact on the System +-------------------- + +The data may be lost or the data status may be unknown, which may affect services. + +Possible Causes +--------------- + +The possible cause of this alarm is that the component status does not meet the requirements before the restoration task is executed or an error occurs in a step during the restoration task. The error depends on the task details. You can obtain logs and task details to handle the alarm. + +Procedure +--------- + +**Checking the component status** + +#. Log in to MRS Manager and choose **Services**. On the page that is displayed, check whether the running status of the components meets the requirements. (OMS and DBService must be in the normal status, and other components must be stopped.) + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12035__li124931122116: + + Restore the component status as required and start the recovery task again. + +#. Log in to MRS Manager and choose **Alarms**. In the alarm list, click the row containing the alarm and obtain the task name from the **Location** area. + +#. Choose **System** > **Recovery Management**. Search for the restoration task based on the task name and view the task details. + +#. Start the restoration task and check whether the task is executed. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`7 `. + +#. .. _alm_12035__li106313102516: + + After 2 minutes, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Collecting fault information** + +7. .. _alm_12035__li449212126111: + + On MRS Manager, choose **System** > **Export Log**. + +8. Contact technical support engineers for help. For details, see `technical support `__. + +Alarm Clearing +-------------- + +This alarm is automatically cleared after the fault is rectified. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12037_ntp_server_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12037_ntp_server_is_abnormal.rst new file mode 100644 index 0000000..0b6d3e7 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12037_ntp_server_is_abnormal.rst @@ -0,0 +1,121 @@ +:original_name: alm_12037.html + +.. _alm_12037: + +ALM-12037 NTP Server Is Abnormal +================================ + +Description +----------- + +This alarm is generated when the NTP server is abnormal. + +This alarm is cleared when the NTP server recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12037 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+------------------------------------------------------------------------------+ +| Parameter | Description | ++=============+==============================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------+ +| HostName | Specifies the IP address of the NTP server for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The NTP server configured on the active OMS node is abnormal. In this case, the active OMS node cannot synchronize time with the NTP server and a time offset may be generated in the cluster. + +Possible Causes +--------------- + +- The NTP server network is faulty. +- The NTP server authentication fails. +- The time cannot be obtained from the NTP server. +- The time obtained from the NTP server is not continuously updated. + +Procedure +--------- + +#. Check the NTP server network. + + a. On the MRS cluster details page, click the alarm from the real-time alarm list. + + b. In the **Alarm Details** area, view the additional information to check whether the NTP server fails to be pinged. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2 `. + + c. .. _alm_12037__en-us_topic_0191813878_li1632254917016: + + Contact the O&M personnel to check the network configuration and ensure that the network between the NTP server and the active OMS node is in normal state. Then, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12037__en-us_topic_0191813878_li39341571165349: + + Check whether the NTP server authentication fails. + + a. Log in to the active management node. + b. Run **ntpq -np** to check whether the NTP server authentication fails. If **refid** of the NTP server is **.AUTH.**, the authentication fails. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm_12037__en-us_topic_0191813878_li1771406117437: + + Check whether the time can be obtained from the NTP server. + + a. View the alarm additional information to check whether the time cannot be obtained from the NTP server. + + - If yes, go to :ref:`3.b `. + - If no, go to :ref:`4 `. + + b. .. _alm_12037__en-us_topic_0191813878_li3545109317619: + + Contact the O&M personnel to rectify the NTP server fault. After the NTP server is in normal state, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_12037__en-us_topic_0191813878_li2737952217524: + + Check whether the time obtained from the NTP server fails to be updated. + + a. View the alarm additional information to check whether the time obtained from the NTP server fails to be updated. + + - If yes, go to :ref:`4.b `. + - If no, go to :ref:`5 `. + + b. .. _alm_12037__en-us_topic_0191813878_li6014697617721: + + Contact the provider of the NTP server to rectify the NTP server fault. After the NTP server is in normal state, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_12037__en-us_topic_0191813878_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12038_monitoring_indicator_dump_failure.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12038_monitoring_indicator_dump_failure.rst new file mode 100644 index 0000000..2154d3b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12038_monitoring_indicator_dump_failure.rst @@ -0,0 +1,128 @@ +:original_name: alm_12038.html + +.. _alm_12038: + +ALM-12038 Monitoring Indicator Dump Failure +=========================================== + +Description +----------- + +This alarm is generated when dumping fails after monitoring indicator dumping is configured on MRS Manager. + +This alarm is cleared when dumping is successful. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12038 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The upper-layer management system fails to obtain monitoring indicators from the MRS Manager system. + +Possible Causes +--------------- + +- The server cannot be connected. +- The save path on the server cannot be accessed. +- The monitoring indicator file fails to be uploaded. + +Procedure +--------- + +#. Contact the O&M personnel to check whether the network connection between the MRS Manager system and the server is normal. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12038__en-us_topic_0191813891_li51875580143018: + + Contact the O&M personnel to restore the network and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12038__en-us_topic_0191813891_li17073310143018: + + Choose **System** > **Monitor Dumping Configuration** and check whether the FTP username, password, port, dump mode, and public key configured on the monitoring indicator dumping configuration page are consistent with those on the server. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _alm_12038__en-us_topic_0191813891_li52527297143018: + + Enter the correct configuration, click **OK**, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_12038__en-us_topic_0191813891_li33278826143018: + + Choose **System** > **Monitor Dumping Configuration** and check the configuration items, including the FTP username, save path, and dumping mode. + + - If the FTP mode is used, go to :ref:`6 `. + - If the SFTP mode is used, go to :ref:`7 `. + +#. .. _alm_12038__en-us_topic_0191813891_li8535940143050: + + Log in to the server. In the default path, check whether the save path (relative path) has the read and write permission on the FTP username. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +#. .. _alm_12038__en-us_topic_0191813891_li35514800143050: + + Log in to the server. In the default path, check whether the save path (absolute path) has the read and write permission on the FTP username. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`8 `. + +#. .. _alm_12038__en-us_topic_0191813891_li28538792143050: + + Add the read and write permission and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +#. .. _alm_12038__en-us_topic_0191813891_li49122127143428: + + Log in to the server and check whether the save path has sufficient disk space. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`10 `. + +#. .. _alm_12038__en-us_topic_0191813891_li18335278143435: + + Delete unnecessary files or go to the monitoring indicator dumping configuration page to change the save path. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +#. .. _alm_12038__en-us_topic_0191813891_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12039_gaussdb_data_is_not_synchronized.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12039_gaussdb_data_is_not_synchronized.rst new file mode 100644 index 0000000..546af65 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12039_gaussdb_data_is_not_synchronized.rst @@ -0,0 +1,140 @@ +:original_name: alm_12039.html + +.. _alm_12039: + +ALM-12039 GaussDB Data Is Not Synchronized +========================================== + +Description +----------- + +The system checks the data synchronization status between the active and standby GaussDB nodes every 10 seconds. This alarm is generated when the synchronization status cannot be queried for six consecutive times or when the synchronization status is abnormal. + +This alarm is cleared when the data synchronization status is normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12039 Critical Yes +======== ============== ========== + +Parameter +--------- + +=================== ========================================= +Parameter Description +=================== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +Local GaussDB HA IP HA IP address of the local GaussDB. +Peer GaussDB HA IP HA IP address of the peer GaussDB. +SYNC_PERSENT Synchronization percentage. +=================== ========================================= + +Impact on the System +-------------------- + +When data is not synchronized between the active and standby GaussDBs, the data may be lost or abnormal if the active instance becomes abnormal. + +Possible Causes +--------------- + +- The network between the active and standby nodes is unstable. +- The standby GaussDB is abnormal. +- The disk space of the standby node is full. + +Procedure +--------- + +#. Go to the MRS cluster details page. In the alarm list on the alarm management tab page, click the row that contains the alarm. In the alarm details, view the IP address of the standby GaussDB node. + +#. Log in to the active management node. + +#. Run the following command to check whether the standby GaussDB is reachable: + + **ping** *heartbeat IP address of the standby GaussDB* + + If yes, go to :ref:`6 `. + + If no, go to :ref:`4 `. + +#. .. _alm_12039__en-us_topic_0191813916_li1095691144355: + + Contact the O&M personnel to check whether the network is faulty. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _alm_12039__en-us_topic_0191813916_li8186264144355: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +#. .. _alm_12039__en-us_topic_0191813916_li39909275144355: + + Log in to the standby GaussDB node. + +#. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + +#. Go to the **${BIGDATA_HOME}/om-0.0.1/sbin/** directory. + + Run the following command to check whether the resource status of the standby GaussDB is normal: + + **sh status-oms.sh** + + In the command output, check whether the following information is displayed in the row where **ResName** is **gaussDB**: + + .. code-block:: + + 10_10_10_231 gaussDB Standby_normal Normal Active_standby + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`15 `. + +9. .. _alm_12039__en-us_topic_0191813916_li58535127144355: + + Log in to the standby GaussDB node. + +10. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + +11. Run the **echo ${BIGDATA_DATA_HOME}/dbdata_om** command to obtain the GaussDB data directory. + +12. Run the **df -h** command to check the system disk partition usage. + +13. Check whether the disk where the GaussDB data directory is mounted is full. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`15 `. + +14. .. _alm_12039__en-us_topic_0191813916_li31581498144355: + + Contact the O&M personnel to expand the disk capacity. After capacity expansion, wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + +15. .. _alm_12039__en-us_topic_0191813916_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12040_insufficient_system_entropy.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12040_insufficient_system_entropy.rst new file mode 100644 index 0000000..8c53f2b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12040_insufficient_system_entropy.rst @@ -0,0 +1,89 @@ +:original_name: alm_12040.html + +.. _alm_12040: + +ALM-12040 Insufficient System Entropy +===================================== + +Description +----------- + +The system checks the entropy at 00:00:00 every day and performs five consecutive checks each time. First, the system checks whether the rng-tools tool is enabled and correctly configured. If not, the system checks the current entropy. This alarm is generated if the entropy is less than 500 in the five checks. + +This alarm is cleared if the true random number mode is configured, random numbers are configured in pseudo-random number mode, or neither the true random number mode nor the pseudo-random number mode is configured but the entropy is greater than or equal to 500 in at least one check among the five checks. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12040 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Decryption failures occur and functions related to decryption are affected, for example, DBService installation. + +Possible Causes +--------------- + +The rngd service is abnormal. + +Procedure +--------- + +#. Go to the cluster details page and choose **Alarms**. + +#. View the alarm details to obtain the value of the **HostName** field in **Location**. + +#. Log in to the node for which the alarm is generated and run the **sudo su - root** command to switch to user **root**. + +#. Run the **/bin/rpm -qa \| grep -w "rng-tools"** command. If the command is executed successfully, run the **ps -ef \| grep -v "grep" \| grep rngd \| tr -d " " \| grep "\\-o/dev/random" \| grep "\\-r/dev/urandom"** command and view the command output. + + - If the command is executed successfully, the rngd service is installed, correctly configured, and is running properly. Go to :ref:`8 `. + - If the command is not executed successfully, the rngd service is not running properly. Then go to :ref:`5 `. + +#. .. _alm_12040__en-us_topic_0191813866_li73201122101210: + + Run the following command to start the rngd service: + + **echo 'EXTRAOPTIONS="-r /dev/urandom -o /dev/random"' >> /etc/sysconfig/rngd** + + **service rngd start** + +#. Run the **service rngd status** command to check whether the rngd service is in the running state. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +#. .. _alm_12040__en-us_topic_0191813866_li11437156145229: + + Wait until 00:00:00 when the system checks the entropy again. Check whether the alarm is cleared automatically. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +#. .. _alm_12040__en-us_topic_0191813866_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12041_permission_of_key_files_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12041_permission_of_key_files_is_abnormal.rst new file mode 100644 index 0000000..acc0617 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12041_permission_of_key_files_is_abnormal.rst @@ -0,0 +1,80 @@ +:original_name: alm_12041.html + +.. _alm_12041: + +ALM-12041 Permission of Key Files Is Abnormal +============================================= + +Description +----------- + +The system checks the permission, users, and user groups of key directories or files every hour. This alarm is generated if any of these is abnormal. + +This alarm is cleared after the problem that causes abnormal permission, users, or user groups is solved. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12041 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +PathName Specifies the file path or file name. +=========== ======================================================= + +Impact on the System +-------------------- + +System functions are unavailable. + +Possible Causes +--------------- + +The user has manually modified the file permission, user information, or user groups, or the system has experienced an unexpected power-off. + +Procedure +--------- + +#. Check the file permission. + + a. Go to the MRS cluster details page and choose **Alarms**. + + b. In the details of the alarm, query the **HostName** (name of the alarmed host) and **PathName** (path or name of the involved file). + + c. Log in to the alarmed node. + + d. Run the **ll** *PathName* command to query the current user, permission, and user group of the file or path. + + e. .. _alm_12041__en-us_topic_0191813951_li5250207310354: + + Go to the **${BIGDATA_HOME}/nodeagent/etc/agent/autocheck** directory and run the **vi keyfile** command. Search for the name of the involved file and query the correct permission of the file. + + f. Compare the actual permission of the file with the permission obtained in :ref:`1.e `. If they are different, change the actual permission, user information, and user group to the correct values. + + g. Wait until the next system check is complete and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12041__en-us_topic_0191813951_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12042_key_file_configurations_are_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12042_key_file_configurations_are_abnormal.rst new file mode 100644 index 0000000..71dcc12 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12042_key_file_configurations_are_abnormal.rst @@ -0,0 +1,84 @@ +:original_name: alm_12042.html + +.. _alm_12042: + +ALM-12042 Key File Configurations Are Abnormal +============================================== + +Description +----------- + +The system checks key file configurations every hour. This alarm is generated if any key configuration is abnormal. + +This alarm is cleared after the configuration becomes normal. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12042 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +PathName Specifies the file path or file name. +=========== ======================================================= + +Impact on the System +-------------------- + +Functions related to the file are abnormal. + +Possible Causes +--------------- + +The user has manually modified the file configurations or the system has experienced an unexpected power-off. + +Procedure +--------- + +#. Check the file configurations. + + a. Go to the MRS cluster details page and choose **Alarms**. + b. In the details of the alarm, query the **HostName** (name of the alarmed host) and **PathName** (path or name of the involved file). + c. Log in to the alarmed node. + d. Manually check and modify the file configurations according to the criteria in :ref:`Related Information `. + e. Wait until the next system check is complete and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_12042__en-us_topic_0191813944_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +.. _alm_12042__en-us_topic_0191813944_section24734811164818: + +Related Information +------------------- + +- **Checking** **/etc/fstab** + + Check whether partitions configured in **/etc/fstab** exist in **/proc/mounts** and whether swap partitions configured in **/etc/fstab** match those in **/proc/swaps**. + +- **Checking** **/etc/hosts** + + Run the **cat /etc/hosts** command. If any of the following situations exists, the file configurations are abnormal. + + - The **/etc/hosts** file does not exist. + - The host name is not configured in the file. + - The IP address of the host is duplicate. + - The IP address of the host does not exist in the **ipconfig** list. + - An IP address in the file is used by multiple hosts. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12043_dns_parsing_duration_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12043_dns_parsing_duration_exceeds_the_threshold.rst new file mode 100644 index 0000000..9112319 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12043_dns_parsing_duration_exceeds_the_threshold.rst @@ -0,0 +1,133 @@ +:original_name: alm_12043.html + +.. _alm_12043: + +ALM-12043 DNS Parsing Duration Exceeds the Threshold +==================================================== + +Description +----------- + +The system checks the DNS parsing duration every 30 seconds. This alarm is generated when the DNS parsing duration exceeds the threshold (the default threshold is 20,000 ms) for multiple times (the default value is **2**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Status** > **DNS Resolution Duration** > **DNS Resolution Duration**. + +This alarm is cleared when **hit number** is **1** and the DNS resolution duration is less than or equal to the threshold. This alarm is cleared when **hit number** is not **1** and the DNS resolution duration is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12043 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +- Kerberos-based secondary authentication is slow. +- The ZooKeeper service is abnormal. +- The node is faulty. + +Possible Causes +--------------- + +- The node is configured with the DNS client. +- The node is equipped with the DNS server and the DNS server is started. + +Procedure +--------- + +**Check whether the node is configured with the DNS client.** + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. View the alarm details to obtain the value of the **HostName** field in **Location**. + +#. Use PuTTY to log in to the node for which the alarm is generated as user **root**. + +#. Run the **cat /etc/resolv.conf** command to check whether the DNS client is installed. + + If information similar to the following is displayed, the DNS client is installed and started: + + .. code-block:: + + namesever 10.2.3.4 + namesever 10.2.3.4 + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`7 `. + +#. .. _alm_12043__en-us_topic_0191813958_en-us_topic_0087039385_li29935381112614: + + Run the **vi /etc/resolv.conf** command to comment out the following content using the number signs (#) and save the file: + + .. code-block:: + + # namesever 10.2.3.4 + # namesever 10.2.3.4 + +#. Check whether this alarm is cleared after 5 minutes. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether the node is equipped with the DNS server and the DNS server is started.** + +7. .. _alm_12043__en-us_topic_0191813958_en-us_topic_0087039385_li39250560112614: + + Run the **service named status** command to check whether the DNS service is installed on the node. + + If information similar to the following is displayed, the DNS server is installed and started: + + .. code-block:: + + Checking for nameserver BIND + version: 9.6-ESV-R7-P4 + CPUs found: 8 + worker threads: 8 + number of zones: 17 + debug level: 0 + xfers running: 0 + xfers deferred: 0 + soa queries in progress: 0 + query logging is ON + recursive clients: 4/0/1000 + tcp clients: 0/100 + server is up and running + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`10 `. + +8. .. _alm_12043__en-us_topic_0191813958_en-us_topic_0087039385_li25178791112614: + + Run the **service named stop** command to stop the DNS server. + +9. Check whether this alarm is cleared after 5 minutes. + + - If yes, no further action is required. + - If no, go to :ref:`10 `. + +10. .. _alm_12043__en-us_topic_0191813958_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..800ee27 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12045_read_packet_dropped_rate_exceeds_the_threshold.rst @@ -0,0 +1,274 @@ +:original_name: alm_12045.html + +.. _alm_12045: + +ALM-12045 Read Packet Dropped Rate Exceeds the Threshold +======================================================== + +Description +----------- + +The system checks the read packet dropped rate every 30 seconds. This alarm is generated when the read packet dropped rate exceeds the threshold (the default threshold is 0.5%) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Packet Rate Information** > **Read Packet Dropped Rate**. + +This alarm is cleared when **hit number** is 1 and the read packet dropped rate is less than or equal to the threshold. This alarm is cleared when **hit number** is greater than 1 and the read packet dropped rate is less than or equal to 90% of the threshold. + +The alarm detection is disabled by default. If you want to enable this function, check whether this function can be enabled based on Checking System Environments. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12045 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service performance deteriorates or some services time out. + +Risk warning: In SUSE kernel 3.0 or later or Red Hat 7.2, the system kernel modifies the mechanism for counting the number of dropped read packets. In this case, this alarm may be generated even if the network is running properly, but services are not affected. You are advised to check the system environment first. + +Possible Causes +--------------- + +- An OS exception occurs. +- The NICs are bonded in active/standby mode. +- The alarm threshold is improperly configured. +- The network environment is abnormal. + +Procedure +--------- + +**View the network packet dropped rate.** + +#. Use PuTTY to log in to any non-alarm node in the cluster as user **omm** and run the **ping** *IP address of the node for which the alarm is generated* **-c 100** command to check whether packet drop occurs on the network. + + .. code-block:: + + # ping 10.10.10.12 -c 5 + PING 10.10.10.12 (10.10.10.12) 56(84) bytes of data. + 64 bytes from 10.10.10.11: icmp_seq=1 ttl=64 time=0.033 ms + 64 bytes from 10.10.10.11: icmp_seq=2 ttl=64 time=0.034 ms + 64 bytes from 10.10.10.11: icmp_seq=3 ttl=64 time=0.021 ms + 64 bytes from 10.10.10.11: icmp_seq=4 ttl=64 time=0.033 ms + 64 bytes from 10.10.10.11: icmp_seq=5 ttl=64 time=0.030 ms + --- 10.10.10.12 ping statistics --- + 5 packets transmitted, 5 received, 0% packet loss, time 4001ms rtt min/avg/max/mdev = 0.021/0.030/0.034/0.006 ms + + .. note:: + + - *IP address of the node for which the alarm is generated*: Query the IP address of the node for which the alarm is generated on the node management page of the MRS cluster details page based on the value of **HostName** in the alarm location information. Check both the IP addresses of the management plane and service plane. + - **-c**: number of check times. The default value is **100**. + + - If yes, go to :ref:`11 `. + - If no, go to :ref:`2 `. + +**Check the system environment.** + +2. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li6542838717657: + + Use PuTTY to log in to the active OMS node or the node for which the alarm is generated as user **omm**. + +3. Run the **cat /etc/*-release** command to check the OS type. + + - If the OS is EulerOS, go to :ref:`4 `. + + .. code-block:: + + # cat /etc/*-release EulerOS release 2.0 (SP2) + EulerOS release 2.0 (SP2) + + - If the OS is SUSE, go to :ref:`5 `. + + .. code-block:: + + # cat /etc/*-release + SUSE Linux Enterprise Server 11 (x86_64) + VERSION = 11 + PATCHLEVEL = 3 + + - Otherwise, go to :ref:`11 `. + +4. .. _alm_12045__en-us_topic_0191813872_li55780683112557: + + Run the **cat /etc/euleros-release** command to check whether the OS version is EulerOS 2.2. + + .. code-block:: + + # cat/etc/euleros-release + EulerOS release 2.0 (SP2) + + - If yes, the alarm sending function cannot be enabled. Go to :ref:`6 `. + - If no, go to :ref:`11 `. + +5. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li42309040172040: + + Run the **cat /proc/version** command to check whether the SUSE kernel version is 3.0 or later. + + .. code-block:: + + # cat /proc/version + Linux version 3.0.101-63-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Tue Jun 23 16:02:31 UTC 2015 (4b89d0c) + + - If yes, the alarm sending function cannot be enabled. Go to :ref:`6 `. + - If no, go to :ref:`11 `. + +6. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li43950618195120: + + Log in to MRS Manager and choose **System** > **Configuration** > **Threshold Configuration**. + +7. In the navigation pane of the **Threshold Configuration** page, choose **Network Reading** > **Network Read Packet Rate Information** > **Read Packet Dropped Rate**. In the right pane, check whether **Send Alarm** is selected. + + - If yes, the alarm sending function is enabled. Go to :ref:`8 `. + - If no, the alarm sending function is disabled. Go to :ref:`10 `. + +8. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li38517503111027: + + In the right pane, deselect **Send Alarm** to shield alarm "Network Read Packet Dropped Rate Exceeds the Threshold." + +9. Go to the MRS cluster details page and choose **Alarms**. + +10. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li16613085112024: + + Search for alarm 12045 and manually clear the alarms that are not automatically cleared. No further action is required. + + .. note:: + + The ID of alarm Network Read Packet Dropped Rate Exceeds the Threshold is 12045. + +**Check whether the NICs are bonded in active/standby mode.** + +11. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li4196511811134: + + Use PuTTY to log in to the node for which the alarm is generated as user **omm** and run the **ls -l /proc/net/bonding** command to check whether the **/proc/net/bonding** directory exists on the node. + + - If yes, as shown in the following figure, the bond mode is configured for the node. Go to :ref:`12 `. + + .. code-block:: + + # ls -l /proc/net/bonding/ + total 0 + -r--r--r-- 1 root root 0 Oct 11 17:35 bond0 + + - If no, the bond mode is not configured for the node. Go to :ref:`14 `. + + .. code-block:: + + # ls -l /proc/net/bonding/ + ls: cannot access /proc/net/bonding/: No such file or directory + +12. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li56651960171744: + + Run the **cat /proc/net/bonding/bond0** command to check whether the value of **Bonding Mode** in the configuration file is **fault-tolerance**. + + .. note:: + + In the preceding command, **bond0** is the name of the bond configuration file. Use the file name obtained in :ref:`11 `. + + .. code-block:: + + # cat /proc/net/bonding/bond0 + Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) + + Bonding Mode: fault-tolerance (active-backup) + Primary Slave: eth1 (primary_reselect always) + Currently Active Slave: eth1 + MII Status: up + MII Polling Interval (ms): 100 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth0 + MII Status: up + Speed: 1000 Mbps + Duplex: full + Link Failure Count: 1 + Slave queue ID: 0 + + Slave Interface: eth1 + MII Status: up + Speed: 1000 Mbps + Duplex: full + Link Failure Count: 1 + Slave queue ID: 0 + + - If yes, the NICs are bonded in active/standby mode. Go to :ref:`13 `. + - If no, go to :ref:`14 `. + +13. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li44376005172456: + + Check whether the NIC specified by **NetworkCardName** in the alarm details is the standby NIC. + + - If yes, the alarm of the standby NIC cannot be automatically cleared. Manually clear the alarm on the alarm management page. No further action is required. + - If no, go to :ref:`14 `. + + .. note:: + + To determine the standby NIC, check the **/proc/net/bonding/bond0** configuration file. If the NIC name corresponding to **NetworkCardName** is **Slave Interface** but not **Currently Active Slave** (the current active NIC), the NIC is the standby one. + +**Check whether the threshold is set properly.** + +14. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li61276131112834: + + Log in to MRS Manager and check whether the threshold (configurable, 0.5% by default) is appropriate. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`15 `. + +15. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li47653126112834: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Packet Rate Information** > **Read Packet Dropped Rate** and change the alarm threshold based on the actual service usage. + +16. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`17 `. + +**Check whether the network is normal.** + +17. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li56023883112834: + + Contact the system administrator to check whether the network is normal. + + - If yes, rectify the network fault and go to :ref:`18 `. + - If no, go to :ref:`19 `. + +18. .. _alm_12045__en-us_topic_0191813872_en-us_topic_0087039254_li4503547112834: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +19. .. _alm_12045__en-us_topic_0191813872_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..ddbb030 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12046_write_packet_dropped_rate_exceeds_the_threshold.rst @@ -0,0 +1,99 @@ +:original_name: alm_12046.html + +.. _alm_12046: + +ALM-12046 Write Packet Dropped Rate Exceeds the Threshold +========================================================= + +Description +----------- + +The system checks the write packet dropped rate every 30 seconds. This alarm is generated when the write packet dropped rate exceeds the threshold (the default threshold is 0.5%) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Writing** > **Network Write Packet Rate Information** > **Write Packet Dropped Rate**. + +When the **hit number** is **1**, this alarm is cleared when the network write packet dropped rate is less than or equal to the threshold. When the **hit number** is greater than **1**, this alarm is cleared when the network write packet dropped rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12046 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service performance deteriorates or some services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network environment is abnormal. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to MRS Manager and check whether the threshold (configurable, 0.5% by default) is appropriate. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12046__en-us_topic_0191813915_en-us_topic_0087039332_li5699560811450: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Write Information** > **Network Write Packet Rate** > **Write Packet Dropped Rate** and change the alarm threshold based on the actual service usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network is normal.** + +4. .. _alm_12046__en-us_topic_0191813915_en-us_topic_0087039332_li4369794811450: + + Contact the system administrator to check whether the network is normal. + + - If yes, rectify the network fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm_12046__en-us_topic_0191813915_en-us_topic_0087039332_li6056359711450: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +6. .. _alm_12046__en-us_topic_0191813915_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..55f8d18 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12047_read_packet_error_rate_exceeds_the_threshold.rst @@ -0,0 +1,99 @@ +:original_name: alm_12047.html + +.. _alm_12047: + +ALM-12047 Read Packet Error Rate Exceeds the Threshold +====================================================== + +Description +----------- + +The system checks the read packet error rate every 30 seconds. This alarm is generated when the read packet error rate exceeds the threshold (the default threshold is **0.5%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Packet Rate Information** > **Read Packet Error Rate**. + +If the **hit number** is **1**, this alarm is cleared when the read packet error rate is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the read packet error rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12047 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The communication is intermittently interrupted, and services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network environment is abnormal. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to MRS Manager and check whether the threshold (configurable, 0.5% by default) is appropriate. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12047__en-us_topic_0191813926_en-us_topic_0087039343_li18938060144325: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Packet Rate Information** > **Read Packet Error Rate** and change the alarm threshold based on the actual service usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network is normal.** + +4. .. _alm_12047__en-us_topic_0191813926_en-us_topic_0087039343_li47122569144325: + + Contact the system administrator to check whether the network is normal. + + - If yes, rectify the network fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm_12047__en-us_topic_0191813926_en-us_topic_0087039343_li52164171144325: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +6. .. _alm_12047__en-us_topic_0191813926_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..ff463ba --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12048_write_packet_error_rate_exceeds_the_threshold.rst @@ -0,0 +1,99 @@ +:original_name: alm_12048.html + +.. _alm_12048: + +ALM-12048 Write Packet Error Rate Exceeds the Threshold +======================================================= + +Description +----------- + +The system checks the write packet error rate every 30 seconds. This alarm is generated when the write packet error rate exceeds the threshold (the default threshold is **0.5%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Writing** > **Network Write Packet Rate Information** > **Write Packet Error Rate**. + +If **hit number** is **1**, this alarm is cleared when the write packet error rate is less than or equal to the threshold. If **hit number** is greater than **1**, this alarm is cleared when the write packet error rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12048 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The communication is intermittently interrupted, and services time out. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network environment is abnormal. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to MRS Manager and check whether the threshold (configurable, 0.5% by default) is appropriate. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12048__en-us_topic_0191813897_en-us_topic_0087039297_li15963175145357: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Writing** > **Network Write Packet Rate Information** > **Write Packet Error Rate** and change the alarm threshold based on the actual service usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network is normal.** + +4. .. _alm_12048__en-us_topic_0191813897_en-us_topic_0087039297_li12888339145357: + + Contact the system administrator to check whether the network is normal. + + - If yes, rectify the network fault and go to :ref:`5 `. + - If no, go to :ref:`6 `. + +5. .. _alm_12048__en-us_topic_0191813897_en-us_topic_0087039297_li60279330145357: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +6. .. _alm_12048__en-us_topic_0191813897_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12049_read_throughput_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12049_read_throughput_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..1be5b43 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12049_read_throughput_rate_exceeds_the_threshold.rst @@ -0,0 +1,104 @@ +:original_name: alm_12049.html + +.. _alm_12049: + +ALM-12049 Read Throughput Rate Exceeds the Threshold +==================================================== + +Description +----------- + +The system checks the read throughput rate every 30 seconds. This alarm is generated when the read throughput rate exceeds the threshold (the default threshold is **80%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Throughput Rate** > **Read Throughput Rate**. + +If the **hit number** is **1**, this alarm is cleared when the read throughput rate is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the read throughput rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12049 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service system runs abnormally or is unavailable. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network port rate does not meet service requirements. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to MRS Manager and check whether the threshold (configurable, 80% by default) is appropriate. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`4 `. + +#. .. _alm_12049__en-us_topic_0191813925_en-us_topic_0087039310_li5311586145835: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Reading** > **Network Read Throughput Rate** > **Read Throughput Rate** to change the alarm threshold based on the actual service usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network port rate meets the requirements.** + +4. .. _alm_12049__en-us_topic_0191813925_en-us_topic_0087039310_li17726490145835: + + In the real-time alarm list, click the alarm. In the **Alarm Details** area, obtain the IP address and network port name of the host for which the alarm is generated. + +5. Use PuTTY to log in to the host for which the alarm is generated as user **root**. + +6. Run the **ethtool** *network port name* command to check the maximum network port rate **Speed**. + + .. note:: + + In a VM environment, you may fail to obtain the network port rate by running commands. You are advised to contact the system administrator to check whether the network port rate meets the requirements. + +7. If the read throughput rate exceeds the threshold, contact the system administrator to increase the network port rate. + +8. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm_12049__en-us_topic_0191813925_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12050_write_throughput_rate_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12050_write_throughput_rate_exceeds_the_threshold.rst new file mode 100644 index 0000000..51bbf49 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12050_write_throughput_rate_exceeds_the_threshold.rst @@ -0,0 +1,104 @@ +:original_name: alm_12050.html + +.. _alm_12050: + +ALM-12050 Write Throughput Rate Exceeds the Threshold +===================================================== + +Description +----------- + +The system checks the write throughput rate every 30 seconds. This alarm is generated when the write throughput rate exceeds the threshold (the default threshold is **80%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Network Writing** > **Network Write Throughput Rate** > **Write Throughput Rate**. + +If the **hit number** is **1**, this alarm is cleared when the write throughput rate is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the write throughput rate is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12050 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+--------------------------------------------------------------+ +| Parameter | Description | ++===================+==============================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| NetworkCardName | Specifies the network port for which the alarm is generated. | ++-------------------+--------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+--------------------------------------------------------------+ + +Impact on the System +-------------------- + +The service system runs abnormally or is unavailable. + +Possible Causes +--------------- + +- The alarm threshold is improperly configured. +- The network port rate does not meet service requirements. + +Procedure +--------- + +**Check whether the threshold is set properly.** + +#. Log in to MRS Manager and check whether the threshold (configurable, 80% by default) is appropriate. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12050__en-us_topic_0191813900_en-us_topic_0087039440_li4243330315441: + + Choose **System** > **Threshold Configuration** > **Device** > **Host** > **Network Writing** > **Network Write Throughput Rate** > **Write Throughput Rate** to change the alarm threshold based on the actual service usage. + +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +**Check whether the network port rate meets the requirements.** + +4. .. _alm_12050__en-us_topic_0191813900_en-us_topic_0087039440_li4094243815441: + + In the real-time alarm list, click the alarm. In the **Alarm Details** area, obtain the IP address and network port of the host for which the alarm is generated. + +5. Use PuTTY to log in to the host for which the alarm is generated as user **root**. + +6. Run the **ethtool** *network port name* command to check the maximum network port rate **Speed**. + + .. note:: + + In a VM environment, you may fail to obtain the network port rate by running commands. You are advised to contact the system administrator to check whether the network port rate meets the requirements. + +7. If the write throughput rate exceeds the threshold, contact the system administrator to increase the network port rate. + +8. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm_12050__en-us_topic_0191813900_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..f7f67e8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12051_disk_inode_usage_exceeds_the_threshold.rst @@ -0,0 +1,105 @@ +:original_name: alm_12051.html + +.. _alm_12051: + +ALM-12051 Disk Inode Usage Exceeds the Threshold +================================================ + +Description +----------- + +The system checks the disk inode usage every 30 seconds. This alarm is generated when the disk inode usage exceeds the threshold (the default threshold is 80%) for multiple times (the default value is 5). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Disk** > **Disk Inode Usage** > **Disk Inode Usage**. + +If the **hit number** is **1**, this alarm is cleared when the disk inode usage is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the disk inode usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12051 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+----------------------------------------------------------------+ +| Parameter | Description | ++===================+================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+----------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+----------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+----------------------------------------------------------------+ +| PartitionName | Specifies the disk partition for which the alarm is generated. | ++-------------------+----------------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+----------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data cannot be written to the file system. + +Possible Causes +--------------- + +- There are too many small files on the disk. +- The system is abnormal. + +Procedure +--------- + +**There are too many small files on the disk.** + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. In the real-time alarm list, click the alarm. In the **Alarm Details** area, obtain the IP address and disk partitions of the host for which the alarm is generated. + +#. Use PuTTY to log in to the host for which the alarm is generated as user **root**. + +#. Run the **df -i** *partition name* command to check the current inode usage of the disk. + +#. If the inode usage exceeds the threshold, manually check whether the small files in the partition can be deleted. + + - If yes, delete the files and go to :ref:`6 `. + - If no, adjust the capacity. Then go to :ref:`7 `. + +#. .. _alm_12051__en-us_topic_0191813886_en-us_topic_0087039294_li4609093115844: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether the system environment is normal.** + +7. .. _alm_12051__en-us_topic_0191813886_en-us_topic_0087039294_li946980415844: + + Contact the operating system maintenance personnel to check whether the system environment is abnormal. + + - If yes, rectify the operating system fault and go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm_12051__en-us_topic_0191813886_en-us_topic_0087039294_li1457809415844: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm_12051__en-us_topic_0191813886_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12052_usage_of_temporary_tcp_ports_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12052_usage_of_temporary_tcp_ports_exceeds_the_threshold.rst new file mode 100644 index 0000000..3cd2ce7 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12052_usage_of_temporary_tcp_ports_exceeds_the_threshold.rst @@ -0,0 +1,125 @@ +:original_name: alm_12052.html + +.. _alm_12052: + +ALM-12052 Usage of Temporary TCP Ports Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the usage of temporary TCP ports every 30 seconds. This alarm is generated when the usage of temporary TCP ports exceeds the threshold (the default threshold is **80%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Host** > **Network Status** > **TCP Ephemeral Port Usage** > **TCP Ephemeral Port Usage**. + +If the **hit number** is **1**, this alarm is cleared when the usage of temporary TCP ports is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the usage of temporary TCP ports is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12052 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Services on the host fail to establish connections with the external and services are interrupted. + +Possible Causes +--------------- + +- The temporary ports do not meet service requirements. +- The system is abnormal. + +Procedure +--------- + +**Expand the range of temporary ports.** + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. In the real-time alarm list, click the alarm. In the **Alarm Details** area, obtain the IP address of the host for which the alarm is generated. + +#. Use PuTTY to log in to the host for which the alarm is generated as user **omm**. + +#. Run the **cat /proc/sys/net/ipv4/ip_local_port_range \|cut -f 1** command to obtain the start port number. Run the **cat /proc/sys/net/ipv4/ip_local_port_range \|cut -f 2** command to obtain the end port number. Subtract the start port number from the end port number to obtain the total number of temporary ports. If the total number of temporary ports is less than 28,232, the random port range of the OS is too small. In this case, contact the system administrator to expand the port range. + +#. Run the **ss -ant 2>/dev/null \| grep -v LISTEN \| awk 'NR > 2 {print $4}'|cut -d ':' -f 2 \| awk '$1 >"**\ *start port number*\ **" {print $1}' \| sort -u \| wc -l** command to calculate the number of used temporary ports. + +#. Calculate the usage of temporary ports using the following formula: Usage of temporary ports = (Number of used temporary ports/Total number of temporary ports) x 100. Check whether the usage exceeds the threshold. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`7 `. + +#. .. _alm_12052__en-us_topic_0191813905_en-us_topic_0087039394_li5574623151245: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`8 `. + +**Check whether the system environment is normal.** + +8. .. _alm_12052__en-us_topic_0191813905_en-us_topic_0087039394_li62811777151245: + + Run the following command to import the temporary file and view the frequently used ports in the **port_result.txt** file: + + **netstat -tnp > $BIGDATA_HOME/tmp/port_result.txt** + + .. code-block:: + + netstat -tnp + + Active Internet connections (w/o servers) + + Proto Recv Send LocalAddress ForeignAddress State PID/ProgramName tcp 0 0 10-120-85-154:45433 10-120-8:25009 CLOSE_WAIT 94237/java + tcp 0 0 10-120-85-154:45434 10-120-8:25009 CLOSE_WAIT 94237/java + tcp 0 0 10-120-85-154:45435 10-120-8:25009 CLOSE_WAIT 94237/java + ... + +9. Run the following command to check the processes that occupy a large number of ports: + + **ps -ef \|grep** *PID* + + .. note:: + + - *PID* indicates the process ID of the port queried in :ref:`8 `. + + - Run the following command to collect information about all processes in the system and check the processes that occupy a large number of ports: + + **ps -ef > $BIGDATA_HOME/tmp/ps_result.txt** + +10. Contact the system administrator to clear the processes that occupy a large number of ports. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + +11. .. _alm_12052__en-us_topic_0191813905_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12053_file_handle_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12053_file_handle_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..b96ad2a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12053_file_handle_usage_exceeds_the_threshold.rst @@ -0,0 +1,93 @@ +:original_name: alm_12053.html + +.. _alm_12053: + +ALM-12053 File Handle Usage Exceeds the Threshold +================================================= + +Description +----------- + +The system checks the handler usage every 30 seconds. This alarm is generated when the handle usage exceeds the threshold (the default threshold is **80%**) for multiple times (the default value is **5**). + +You can change the threshold by choosing **System** > **Threshold Configuration** > **Device** > **Host** > **Host Status** > **Host File Handle Usage** > **Host File Handle Usage**. + +If the **hit number** is **1**, this alarm is cleared when the file handle usage is less than or equal to the threshold. If the **hit number** is greater than **1**, this alarm is cleared when the file handle usage is less than or equal to 90% of the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12053 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The system applications fail to open files, access networks, and perform other I/O operations. The applications are running improperly. + +Possible Causes +--------------- + +- The number of file handles does not meet service requirements. +- The system is abnormal. + +Procedure +--------- + +**Increase the number of file handles.** + +#. Go to the MRS cluster details page and choose **Alarms**. +#. In the real-time alarm list, click the alarm. In the **Alarm Details** area, obtain the IP address of the host for which the alarm is generated. +#. Use PuTTY to log in to the host for which the alarm is generated as user **root**. +#. Run the **ulimit -n** command to check the maximum number of handles set in the system. +#. If the file handle usage exceeds the threshold, contact the system administrator to increase the number of system file handles. +#. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +**Check whether the system environment is normal.** + +7. .. _alm_12053__en-us_topic_0191813957_en-us_topic_0087039427_li6831599151742: + + Contact the system administrator to check whether the OS is abnormal. + + - If yes, rectify the operating system fault and go to :ref:`8 `. + - If no, go to :ref:`9 `. + +8. .. _alm_12053__en-us_topic_0191813957_en-us_topic_0087039427_li2777630151742: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +9. .. _alm_12053__en-us_topic_0191813957_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12054_the_certificate_file_is_invalid.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12054_the_certificate_file_is_invalid.rst new file mode 100644 index 0000000..a29501d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12054_the_certificate_file_is_invalid.rst @@ -0,0 +1,135 @@ +:original_name: alm_12054.html + +.. _alm_12054: + +ALM-12054 The Certificate File Is Invalid +========================================= + +Description +----------- + +The system checks whether the certificate file is invalid (has expired or is not yet valid) on 23:00 every day. This alarm is generated when the certificate file is invalid. + +This alarm is cleared when the status of the newly imported certificate is valid. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12054 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system reminds users that the certificate file is invalid. If the certificate file is invalid, some functions are restricted and cannot be used properly. + +Possible Causes +--------------- + +No certificate (HA root certificate or HA user certificate) is imported to the system, the certificate fails to be imported, or the certificate file is invalid. + +Procedure +--------- + +**Check the alarm cause.** + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. In the real-time alarm list, click the row that contains the alarm. + + In the **Alarm Details** area, view the additional information about the alarm. + + - If **CA Certificate** is displayed in the additional alarm information, use PuTTY to log in to the active OMS management node as user **omm** and go to :ref:`3 `. + - If **HA root Certificate** is displayed in the additional information, check **Location** to obtain the name of the host involved in this alarm. Then use PuTTY to log in to the host as user **omm** and go to :ref:`4 `. + - If **HA server Certificate** is displayed in the additional information, check **Location** to obtain the name of the host involved in this alarm. Then use PuTTY to log in to the host as user **omm** and go to :ref:`5 `. + +**Check the validity period of the certificate files in the system.** + +3. .. _alm_12054__en-us_topic_0191813937_en-us_topic_0087039414_li2768003415237: + + Check whether the current system time is in the validity period of the CA certificate. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/cert/root/ca.crt** command to check the effective time and due time of the root certificate. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`6 `. + +4. .. _alm_12054__en-us_topic_0191813937_en-us_topic_0087039414_li6628516015237: + + Check whether the current system time is in the validity period of the HA root certificate. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt** command to check the effective time and due time of the HA root certificate. + + - If yes, go to :ref:`8 `. + - If no, go to :ref:`7 `. + +5. .. _alm_12054__en-us_topic_0191813937_en-us_topic_0087039414_li3401162015237: + + Check whether the current system time is in the validity period of the HA user certificate. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt** command to check the effective time and due time of the HA user certificate. + + - If yes, go to :ref:`8 `. + + - If no, go to :ref:`7 `. + + The following is an example of the effective time and expiration time of a CA or HA certificate: + + .. code-block:: + + Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 97:d5:0e:84:af:ec:34:d8 + Signature Algorithm: sha256WithRSAEncryption + Issuer: C=CountryName, ST=State, L=Locality, O=Organization, OU=IT, CN=HADOOP.COM + Validity + Not Before: Dec 13 06:38:26 2016 GMT // Effective time + Not After : Dec 11 06:38:26 2026 GMT // Expiration time + +**Import certificate files.** + +6. .. _alm_12054__en-us_topic_0191813937_en-us_topic_0087039414_li99782015237: + + Import a new CA certificate file. + + Contact O&M personnel to apply for or generate a new CA certificate file and import it. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`8 `. + - If no, no further action is required. + +7. .. _alm_12054__en-us_topic_0191813937_en-us_topic_0087039414_li3092985115237: + + Import a new HA certificate file. + + Apply for or generate a new HA certificate file and import it by referring to :ref:`Replacing the HA Certificate `. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`8 `. + - If no, no further action is required. + +8. .. _alm_12054__en-us_topic_0191813937_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst new file mode 100644 index 0000000..3cb3e14 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12055_the_certificate_file_is_about_to_expire.rst @@ -0,0 +1,135 @@ +:original_name: alm_12055.html + +.. _alm_12055: + +ALM-12055 The Certificate File Is About to Expire +================================================= + +Description +----------- + +The system checks the certificate file on 23:00 every day. This alarm is generated if the certificate file is about to expire with a validity period less than days set in the alarm threshold. + +This alarm is generated if the status of the newly imported certificate is valid. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12055 Minor Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system reminds users that the license is about to expire. If the license expires, some functions are restricted and cannot be used properly. + +Possible Causes +--------------- + +The remaining validity period of the CA certificate, HA root certificate, or HA user certificate is smaller than the alarm threshold. + +Procedure +--------- + +**Check the alarm cause.** + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. In the real-time alarm list, click the row that contains the alarm. + + In the **Alarm Details** area, view the additional information about the alarm. + + - If **CA Certificate** is displayed in the additional alarm information, use PuTTY to log in to the active OMS management node as user **omm** and go to :ref:`3 `. + - If **HA root Certificate** is displayed in the additional information, check **Location** to obtain the name of the host involved in this alarm. Then use PuTTY to log in to the host as user **omm** and go to :ref:`4 `. + - If **HA server Certificate** is displayed in the additional information, check **Location** to obtain the name of the host involved in this alarm. Then use PuTTY to log in to the host as user **omm** and go to :ref:`5 `. + +**Check the validity period of the certificate files in the system.** + +3. .. _alm_12055__en-us_topic_0191813941_en-us_topic_0087039447_li31866665152950: + + Check whether the remaining validity period of the CA certificate is smaller than the alarm threshold. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/cert/root/ca.crt** command to check the effective time and due time of the root certificate. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`8 `. + +4. .. _alm_12055__en-us_topic_0191813941_en-us_topic_0087039447_li35214520152950: + + Check whether the remaining validity period of the HA root certificate is smaller than the alarm threshold. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt** command to check the effective time and due time of the HA root certificate. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`8 `. + +5. .. _alm_12055__en-us_topic_0191813941_en-us_topic_0087039447_li289449152950: + + Check whether the remaining validity period of the HA user certificate is smaller than the alarm threshold. + + Run the **openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt** command to check the effective time and due time of the HA user certificate. + + - If yes, go to :ref:`7 `. + + - If no, go to :ref:`8 `. + + The following is an example of the effective time and expiration time of a CA or HA certificate: + + .. code-block:: + + Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 97:d5:0e:84:af:ec:34:d8 + Signature Algorithm: sha256WithRSAEncryption + Issuer: C=CountryName, ST=State, L=Locality, O=Organization, OU=IT, CN=HADOOP.COM + Validity + Not Before: Dec 13 06:38:26 2016 GMT // Effective time + Not After : Dec 11 06:38:26 2026 GMT // Expiration time + +**Import certificate files.** + +6. .. _alm_12055__en-us_topic_0191813941_en-us_topic_0087039447_li12048984152950: + + Import a new CA certificate file. + + Contact O&M personnel to apply for or generate a new CA certificate file and import it. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`8 `. + - If no, no further action is required. + +7. .. _alm_12055__en-us_topic_0191813941_en-us_topic_0087039447_li50119675152950: + + Import a new HA certificate file. + + Apply for or generate a new HA certificate file and import it by referring to :ref:`Replacing the HA Certificate `. Manually clear the alarm and check whether this alarm is generated again during periodic check. + + - If yes, go to :ref:`8 `. + - If no, no further action is required. + +8. .. _alm_12055__en-us_topic_0191813941_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12357_failed_to_export_audit_logs_to_obs.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12357_failed_to_export_audit_logs_to_obs.rst new file mode 100644 index 0000000..8d38efa --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-12357_failed_to_export_audit_logs_to_obs.rst @@ -0,0 +1,93 @@ +:original_name: alm_12357.html + +.. _alm_12357: + +ALM-12357 Failed to Export Audit Logs to OBS +============================================ + +Description +----------- + +If the user has configured audit log export to the OBS on MRS Manager, the system regularly exports audit logs to the OBS. This alarm is reported if the system fails to access the OBS. + +This alarm is cleared after the system exports audit logs to the OBS successfully. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +12357 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The local system saves a maximum of seven compressed service audit log files. If this alarm persists, local service audit logs may be lost. + +The local system saves a maximum of 50 management audit log files (each file contains 100,000 records). If this alarm persists, local management audit logs may be lost. + +Possible Causes +--------------- + +- Connection to the OBS server fails. +- The specified OBS file system does not exist. +- The user AK/SK information is invalid. +- The local OBS configuration cannot be obtained. + +Procedure +--------- + +#. Log in to the OBS server and check whether the OBS server can be properly accessed. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2 `. + +#. .. _alm_12357__en-us_topic_0191813876_li51875580143018: + + Contact the maintenance personnel to repair OBS. Then check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_12357__en-us_topic_0191813876_li17073310143018: + + On MRS Manager, choose **System** > **Export Audit Log**. Check whether the AK/SK information, file system name, and path are correct. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`4 `. + +#. .. _alm_12357__en-us_topic_0191813876_li52527297143018: + + Correct the information. Then check whether the alarm is cleared when the export task is executed again. + + .. note:: + + To check alarm clearance quickly, you can set the start time of audit log collection to 10 or 30 minutes later than the current time. After checking the result, restore the original start time. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_12357__en-us_topic_0191813876_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst new file mode 100644 index 0000000..7b311d9 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13000_zookeeper_service_unavailable.rst @@ -0,0 +1,170 @@ +:original_name: alm_13000.html + +.. _alm_13000: + +ALM-13000 ZooKeeper Service Unavailable +======================================= + +Description +----------- + +The system checks the ZooKeeper service status every 30 seconds. This alarm is generated when the ZooKeeper service is unavailable. + +This alarm is cleared when the ZooKeeper service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +ZooKeeper fails to provide coordination services for upper-layer components and the components depending on ZooKeeper may not run properly. + +Possible Causes +--------------- + +- The ZooKeeper instance is abnormal. +- The disk capacity is insufficient. +- The network is faulty. +- The DNS is installed on the ZooKeeper node. + +Procedure +--------- + +**Check the ZooKeeper service instance status.** + +#. On the MRS cluster details page, choose **Components** > **ZooKeeper** > **quorumpeer**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **ZooKeeper** > **quorumpeer**. + +#. Check whether the ZooKeeper instances are normal. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`3 `. + +#. .. _alm_13000__en-us_topic_0191813962_li43049911145525: + + Select instances whose status is not good and choose **More** > **Restart Instance**. + +#. Check whether the instance status is good after restart. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`19 `. + +#. .. _alm_13000__en-us_topic_0191813962_li64143807145525: + + On the **Alarms** tab page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + + **Check disk status.** + +#. .. _alm_13000__en-us_topic_0191813962_li40423354145525: + + On the MRS cluster details page, choose **Components** > **ZooKeeper** > **quorumpeer**, and check the host information of each node housing the ZooKeeper instance. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **ZooKeeper** > **quorumpeer** to view the host information of each node housing the ZooKeeper instance. + +#. On the MRS cluster details page, click the **Nodes** tab and expand a node group. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Hosts**. + +#. In the **Disk Usage** column, check whether the disk space of each node housing ZooKeeper instances is insufficient (disk usage exceeds 80%). + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`11 `. + +#. .. _alm_13000__en-us_topic_0191813962_li66786352145525: + + Expand the disk capacity. For details, see :ref:`ALM-12017 Insufficient Disk Capacity `. + +#. On the **Alarms** tab page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`11 `. + + **Check network communication status.** + +#. .. _alm_13000__en-us_topic_0191813962_li835031145525: + + On the Linux node housing the ZooKeeper instance, run the **ping** command to check whether the host names of other nodes housing the ZooKeeper instances can be pinged successfully. + + - If yes, go to :ref:`15 `. + - If no, go to :ref:`12 `. + +#. .. _alm_13000__en-us_topic_0191813962_li7515284145525: + + Modify the IP addresses in **/etc/hosts** and add the mapping between host names and IP addresses. + +#. Run the **ping** command again to check whether the host names of other nodes housing the ZooKeeper instances can be pinged successfully. + + - If yes, go to :ref:`14 `. + - If no, go to :ref:`19 `. + +#. .. _alm_13000__en-us_topic_0191813962_li15395686145525: + + On the **Alarms** tab page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`15 `. + + **Check the DNS.** + +#. .. _alm_13000__en-us_topic_0191813962_li53340623145525: + + Check whether the DNS is installed on the node housing the ZooKeeper instance. On the Linux node housing the ZooKeeper instance, run the **cat /etc/resolv.conf** command to check whether the file is empty. + + - If yes, go to :ref:`16 `. + - If no, go to :ref:`19 `. + +#. .. _alm_13000__en-us_topic_0191813962_li54403854145525: + + Run the **service named status** command to check whether the DNS is started. + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`19 `. + +#. .. _alm_13000__en-us_topic_0191813962_li44636076145525: + + Run the **service named stop** command to stop the DNS service. If "Shutting down name server BIND waiting for named to shut down (28s)" is displayed, the DNS service is stopped successfully. Comment out the content (if any) in **/etc/resolv.conf**. + +#. On the **Alarms** tab page, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`19 `. + +#. .. _alm_13000__en-us_topic_0191813962_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst new file mode 100644 index 0000000..1e24fab --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13001_available_zookeeper_connections_are_insufficient.rst @@ -0,0 +1,127 @@ +:original_name: alm_13001.html + +.. _alm_13001: + +ALM-13001 Available ZooKeeper Connections Are Insufficient +========================================================== + +Description +----------- + +The system checks ZooKeeper connections every 30 seconds. This alarm is generated when the system detects that the number of used ZooKeeper instance connections exceeds the threshold (80% of the maximum connections). + +This alarm is cleared when the number of used ZooKeeper instance connections is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Available ZooKeeper connections are insufficient. When the connection usage reaches 100%, external connections cannot be handled. + +Possible Causes +--------------- + +The number of connections to the ZooKeeper node exceeds the threshold. Connection leakage occurs on some connection processes, or the maximum number of connections does not meet the requirement of the actual scenario. + +Procedure +--------- + +#. Check the connection status. + + a. On the MRS cluster details page, choose **Alarms** > **ALM-13001 Available ZooKeeper Connections Are Insufficient** > **Location**. Check the IP address of the node for which the alarm is generated. + + b. Obtain the PID of the ZooKeeper process. Log in to the node for which this alarm is generated and run the **pgrep -f proc_zookeeper** command. + + c. Check whether the PID can be successfully obtained. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2 `. + + d. .. _alm_13001__en-us_topic_0191813882_cn_58_42_000001_2_mmccppss_stepb2: + + Obtain all the IP addresses connected to the ZooKeeper instance and the number of connections and check 10 IP addresses with top connections. Run the **lsof -i|grep $pid \| awk '{print $9}' \| cut -d : -f 2 \| cut -d \\> -f 2 \| awk '{a[$1]++} END {for(i in a){print i,a[i] \| "sort -r -g -k 2"}}' \| head -10** command based on the obtained PID value. (**$pid** is the PID obtained in the preceding step.) + + e. Check whether the node IP addresses and the number of connections are successfully obtained. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`2 `. + + f. .. _alm_13001__en-us_topic_0191813882_cn_58_42_000001_2_mmccppss_stepb4: + + Obtain the ID of the port connected to the process. Run the **lsof -i|grep $pid \| awk '{print $9}'|cut -d \\> -f 2 \|grep $IP\| cut -d : -f 2** command based on the obtained PID and IP address. (**$pid** and **$IP** are the PID and IP address obtained in the preceding step.) + + g. Check whether the port ID is successfully obtained. + + - If yes, go to :ref:`1.h `. + - If no, go to :ref:`2 `. + + h. .. _alm_13001__en-us_topic_0191813882_cn_58_42_000001_2_mmccppss_stepb5: + + Obtain the ID of the connected process. Log in to each IP address and run the following command based on the obtained port ID: **lsof -i|grep $port**. (**$port** is the port ID obtained in the preceding step.) + + i. Check whether the process ID is successfully obtained. + + - If yes, go to :ref:`1.j `. + - If no, go to :ref:`2 `. + + j. .. _alm_13001__en-us_topic_0191813882_stepb6: + + Check whether connection leakage occurs on the process based on the obtained process ID. + + - If yes, go to :ref:`1.k `. + - If no, go to :ref:`1.l `. + + k. .. _alm_13001__en-us_topic_0191813882_stepb7: + + Close the process where connection leakage occurs and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`1.l `. + + l. .. _alm_13001__en-us_topic_0191813882_stepb8: + + On the MRS cluster details page, choose **Components** > **ZooKeeper** > **Service Configuration**. Set **Type** to **All**, choose **quorumpeer** > **Performance**, and change the value of **maxCnxns** to **20000** or more. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager, choose **Services** > **ZooKeeper** > **Service Configuration**. Set **Type** to **All**, choose **quorumpeer** > **Performance**, and change the value of **maxCnxns** to **20000** or more. + + m. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_13001__en-us_topic_0191813882_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13002_zookeeper_memory_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13002_zookeeper_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..6b1fc97 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-13002_zookeeper_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,114 @@ +:original_name: alm_13002.html + +.. _alm_13002: + +ALM-13002 ZooKeeper Memory Usage Exceeds the Threshold +====================================================== + +Description +----------- + +The system checks the ZooKeeper service status every 30 seconds. The alarm is generated when the memory usage of a ZooKeeper instance exceeds the threshold (80% of the maximum memory). + +The alarm is cleared when the memory usage is less than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +13002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the available ZooKeeper memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The memory usage of the ZooKeeper instance is overused or the memory is inappropriately allocated. + +Procedure +--------- + +#. Check the memory usage. + + a. On the MRS cluster details page, choose **Alarms** > **ALM-13002 ZooKeeper Memory Usage Exceeds the Threshold** > **Location**. Check the IP address of the instance for which the alarm is generated. + + b. On the MRS cluster details page, choose **Components** > **ZooKeeper** > **Instances** > **quorumpeer** (IP address of the instance for which the alarm is generated) > **Customize** > **ZooKeeper Heap And Direct Buffer Resource**. Check the heap memory usage. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **ZooKeeper** > **Instance** > **quorumpeer** (IP address of the instance for which the alarm is generated) > **Customize** > **ZooKeeper Heap And Direct Buffer Resource**. Check the heap memory usage. + + c. Check whether the used heap memory of ZooKeeper reaches 80% of the maximum heap memory specified for ZooKeeper. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`1.f `. + + d. .. _alm_13002__en-us_topic_0191813895_cn_58_42_000001_3_mmccppss_stepb2: + + On MRS Manager, choose **Services** > **ZooKeeper** > **Configuration** > **All** > **quorumpeer** > **System**. Increase the value of **-Xmx** in **GC_OPTS** as required. + + e. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`1.f `. + + f. .. _alm_13002__en-us_topic_0191813895_cn_58_42_000001_3_mmccppss_stepb4: + + On the MRS cluster details page, choose **Components** > **ZooKeeper** > **Instances** > **quorumpeer** (IP address of the instance for which the alarm is generated) > **Customize** > **ZooKeeper Heap And Direct Buffer Resource**. Check the direct buffer memory usage. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **ZooKeeper** > **Instance** > **quorumpeer** (IP address of the instance for which the alarm is generated) > **Customize** > **ZooKeeper Heap And Direct Buffer Resource**. Check the direct buffer memory usage. + + g. Check whether the used direct buffer memory of ZooKeeper reaches 80% of the maximum direct buffer memory specified for ZooKeeper. + + - If yes, go to :ref:`1.h `. + - If no, go to :ref:`2 `. + + h. .. _alm_13002__en-us_topic_0191813895_li49457583153150: + + On the MRS cluster details page, choose **Components** > **ZooKeeper** > **Service Configuration**. Set **Type** to **All** and choose **quorumpeer** > **System**. + + Increase the value of **-XX:MaxDirectMemorySize** in **GC_OPTS** as required. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager, choose **Services** > **ZooKeeper** > **Service Configuration**. Set **Type** to **All** and choose **quorumpeer** > **System**. + + i. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_13002__en-us_topic_0191813895_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14000_hdfs_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14000_hdfs_service_unavailable.rst new file mode 100644 index 0000000..694919d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14000_hdfs_service_unavailable.rst @@ -0,0 +1,107 @@ +:original_name: alm_14000.html + +.. _alm_14000: + +ALM-14000 HDFS Service Unavailable +================================== + +Description +----------- + +The system checks the service status of NameService every 30 seconds. This alarm is generated when the system considers that the HDFS service is unavailable because all the NameService services are abnormal. + +This alarm is cleared when at least one NameService service is normal and the system considers that the HDFS service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +HDFS fails to provide services for HDFS service-based upper-layer components, such as HBase and MapReduce. As a result, users cannot read or write files. + +Possible Causes +--------------- + +- ZooKeeper is abnormal. +- All NameService services are abnormal. + +Procedure +--------- + +#. Check the ZooKeeper status. + + a. Go to the MRS cluster details page. On the **Components** tab page, check whether the health status of the ZooKeeper service is **Good**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click the **Services** tab. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`2.a `. + + b. .. _alm_14000__en-us_topic_0191813870_cn_58_42_000001_4_mmccppss_ss2: + + Rectify the health status of the ZooKeeper service. For details, see :ref:`ALM-13000 ZooKeeper Service Unavailable `. Then check whether the health status of the ZooKeeper service is **Good**. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_14000__en-us_topic_0191813870_cn_58_42_000001_4_mmccppss_ss3: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Handle the NameService service exception alarm. + + a. .. _alm_14000__en-us_topic_0191813870_cn_58_42_000001_4_mmccppss_ss4: + + Go to the MRS cluster details page. On the **Alarms** page, check whether all NameService services have abnormal alarms. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3 `. + + b. .. _alm_14000__en-us_topic_0191813870_cn_58_42_000001_4_mmccppss_ss5: + + Handle the abnormal NameService services following the instructions in :ref:`ALM-14010 NameService Service Is Abnormal ` and check whether each NameService service exception alarm is cleared. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_14000__en-us_topic_0191813870_cn_58_42_000001_4_mmccppss_checkbk_5: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_14000__en-us_topic_0191813870_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..882c4ba --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14001_hdfs_disk_usage_exceeds_the_threshold.rst @@ -0,0 +1,104 @@ +:original_name: alm_14001.html + +.. _alm_14001: + +ALM-14001 HDFS Disk Usage Exceeds the Threshold +=============================================== + +Description +----------- + +The system checks the disk usage of the HDFS cluster every 30 seconds and compares the actual disk usage with the threshold. The HDFS cluster disk usage indicator has a default threshold. This alarm is generated when the HDFS disk usage exceeds the threshold. + +This alarm is cleared when the disk usage of the HDFS cluster is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| NSName | Specifies the NameService service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The performance of writing data to HDFS is affected. + +Possible Causes +--------------- + +The disk space configured for the HDFS cluster is insufficient. + +Procedure +--------- + +#. Check the disk capacity and delete unnecessary files. + + a. On the MRS cluster details page, choose **Components** > **HDFS**. The **Service Status** page is displayed. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **HDFS**. + + b. In the **Charts** area, view the value of the monitoring indicator **Percentage of HDFS Capacity** to check whether the HDFS disk usage exceeds the threshold (80% by default). + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_14001__en-us_topic_0191813969_cn_58_42_000001_5_mmccppss_step5: + + Use the client on the cluster node and run the **hdfs dfsadmin -report** command to check whether the value of **DFS Used%** is less than 100% minus the threshold. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`3 `. + + d. Use the client on the cluster node and run the **hdfs dfs -rm -r** *file or directory path* command to delete unnecessary files. + + e. .. _alm_14001__en-us_topic_0191813969_li39567352: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Expand the system. + + a. .. _alm_14001__en-us_topic_0191813969_cn_58_42_000001_5_mmccppss_step13: + + Expand the disk capacity. + + b. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_14001__en-us_topic_0191813969_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..5ce9bd3 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14002_datanode_disk_usage_exceeds_the_threshold.rst @@ -0,0 +1,102 @@ +:original_name: alm_14002.html + +.. _alm_14002: + +ALM-14002 DataNode Disk Usage Exceeds the Threshold +=================================================== + +Description +----------- + +The system checks the DataNode disk usage every 30 seconds and compares the actual disk usage with the threshold. The **Percentage of DataNode Capacity** indicator has a default threshold. This alarm is generated when the value of the **Percentage of DataNode Capacity** indicator exceeds the threshold. + +This alarm is cleared when the value of the **Percentage of DataNode Capacity** indicator is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Insufficient disk space will impact read/write to HDFS. + +Possible Causes +--------------- + +- The disk space configured for the HDFS cluster is insufficient. +- Data skew occurs among DataNodes. + +Procedure +--------- + +#. Check the cluster disk capacity. + + a. Go to the MRS cluster details page. On the **Alarms** page, check whether the ALM-14001 HDFS Disk Usage Exceeds the Threshold alarm exists. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`2.a `. + + b. .. _alm_14002__en-us_topic_0191813920_yt2: + + Handle the alarm by following the instructions in ALM-14001 HDFS Disk Usage Exceeds the Threshold and check whether the alarm is cleared. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_14002__en-us_topic_0191813920_yt3: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the balance status of DataNodes. + + a. .. _alm_14002__en-us_topic_0191813920_li64268160: + + Use the client on the cluster node, run the **hdfs dfsadmin -report** command to view the value of **DFS Used%** on the DataNode for which the alarm is generated, and compare the value with those on other DataNodes. Check whether the difference between the values is larger than 10. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3 `. + + b. .. _alm_14002__en-us_topic_0191813920_step17: + + If data skew occurs, use the client on the cluster node and run the **hdfs balancer -threshold 10** command. + + c. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_14002__en-us_topic_0191813920_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst new file mode 100644 index 0000000..10e194a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold.rst @@ -0,0 +1,98 @@ +:original_name: alm_14003.html + +.. _alm_14003: + +ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the number of lost blocks every 30 seconds and compares the number of lost blocks with the threshold. The lost blocks indicator has a default threshold. This alarm is generated when the number of lost blocks exceeds the threshold. + +This alarm is cleared when the number of lost blocks is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| NSName | Specifies the NameService service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data stored in HDFS is lost. HDFS may enter the safe mode and cannot provide write services. Lost block data cannot be restored. + +Possible Causes +--------------- + +- The DataNode instance is abnormal. +- Data is deleted. + +Procedure +--------- + +#. Check the DataNode instance. + + a. On the MRS cluster details page, choose **Components** > **HDFS** > **Instances**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **HDFS** > **Instances**. + + b. Check whether the status of all DataNode instances is **Good**. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`1.c `. + + c. .. _alm_14003__en-us_topic_0191813959_li2677020115402: + + Restart the DataNode instance and check whether the restart is successful. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`2.a `. + +#. Delete the damaged file. + + a. .. _alm_14003__en-us_topic_0191813959_li435173115402: + + Use the client on the cluster node. Run the **hdfs fsck / -delete** command to delete the lost file. Then rewrite the file and recover the data. + + b. .. _alm_14003__en-us_topic_0191813959_li4975462315402: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_14003__en-us_topic_0191813959_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14004_number_of_damaged_hdfs_blocks_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14004_number_of_damaged_hdfs_blocks_exceeds_the_threshold.rst new file mode 100644 index 0000000..de4169e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14004_number_of_damaged_hdfs_blocks_exceeds_the_threshold.rst @@ -0,0 +1,63 @@ +:original_name: alm_14004.html + +.. _alm_14004: + +ALM-14004 Number of Damaged HDFS Blocks Exceeds the Threshold +============================================================= + +Description +----------- + +The system checks the number of damaged blocks every 30 seconds and compares the number of damaged blocks with the threshold. The damaged blocks indicator has a default threshold. This alarm is generated when the number of damaged blocks exceeds the threshold. + +This alarm is cleared when the number of damaged blocks is less than or equal to the threshold. You are advised to run the **hdfs fsck /** command to check whether any file is completely damaged. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14004 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| NSName | Specifies the NameService service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data is damaged and HDFS fails to read files. + +Possible Causes +--------------- + +- The DataNode instance is abnormal. +- Data verification information is damaged. + +Procedure +--------- + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst new file mode 100644 index 0000000..d51483f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14006_number_of_hdfs_files_exceeds_the_threshold.rst @@ -0,0 +1,95 @@ +:original_name: alm_14006.html + +.. _alm_14006: + +ALM-14006 Number of HDFS Files Exceeds the Threshold +==================================================== + +Description +----------- + +The system periodically checks the number of HDFS files every 30 seconds and compares the number of HDFS files with the threshold. This alarm is generated when the system detects that the number of HDFS files exceeds the threshold. + +This alarm is cleared when the number of HDFS files is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14006 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| NSName | Specifies the NameService service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Disk storage space is insufficient, which may result in data import failure. The performance of the HDFS system is affected. + +Possible Causes +--------------- + +The number of HDFS files exceeds the threshold. + +Procedure +--------- + +#. Check whether unnecessary files exist in the system. + + a. Use the client on the cluster node and run the **hdfs dfs -ls** *file or directory path* command to check whether the file or directory can be deleted. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`2.a `. + + b. .. _alm_14006__en-us_topic_0191813952_alm-14006_mmccppss_step4: + + Run the **hdfs dfs -rm -r** *file or directory path* command. Delete unnecessary files, wait 5 minutes, and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the number of files in the system. + + a. .. _alm_14006__en-us_topic_0191813952_yt16: + + On MRS Manager, choose **System** > **Threshold Configuration**. + + b. In the navigation tree on the left, choose **Services** > **HDFS** > **HDFS File** > **Total Number of Files**. + + c. In the right pane, modify the threshold in the rule based on the number of current HDFS files. + + To check the number of HDFS files, choose **Services** > **HDFS**, click **Customize** in the **Real-Time Statistics** area on the right, and select the **HDFS File** monitoring item. + + d. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_14006__en-us_topic_0191813952_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14007_hdfs_namenode_memory_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14007_hdfs_namenode_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..262ead1 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14007_hdfs_namenode_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,70 @@ +:original_name: alm_14007.html + +.. _alm_14007: + +ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the HDFS NameNode memory usage every 30 seconds and compares the actual memory usage with the threshold. The HDFS NameNode memory usage has a default threshold. This alarm is generated when the HDFS NameNode memory usage exceeds the threshold. + +This alarm is cleared when the HDFS NameNode memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14007 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the memory usage of the HDFS NameNode is too high, data read/write performance of HDFS will be affected. + +Possible Causes +--------------- + +The HDFS NameNode memory is insufficient. + +Procedure +--------- + +#. Delete unnecessary files. + + a. Use the client on the cluster node and run the **hdfs dfs -rm -r** *file or directory path* command to delete unnecessary files. + b. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_14007__en-us_topic_0191813906_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14008_hdfs_datanode_memory_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14008_hdfs_datanode_memory_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..09f1171 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14008_hdfs_datanode_memory_usage_exceeds_the_threshold.rst @@ -0,0 +1,70 @@ +:original_name: alm_14008.html + +.. _alm_14008: + +ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the HDFS DataNode memory usage every 30 seconds and compares the actual memory usage with the threshold. The HDFS DataNode memory usage has a default threshold. This alarm is generated when the HDFS DataNode memory usage exceeds the threshold. + +This alarm is cleared when the HDFS DataNode memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14007 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The HDFS DataNode memory usage is too high, which affects the data read/write performance of the HDFS. + +Possible Causes +--------------- + +The HDFS DataNode memory is insufficient. + +Procedure +--------- + +#. Delete unnecessary files. + + a. Use the client on the cluster node and run the **hdfs dfs -rm -r** *file or directory path* command to delete unnecessary files. + b. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_14008__en-us_topic_0191813907_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14009_number_of_faulty_datanodes_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14009_number_of_faulty_datanodes_exceeds_the_threshold.rst new file mode 100644 index 0000000..853fac4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14009_number_of_faulty_datanodes_exceeds_the_threshold.rst @@ -0,0 +1,151 @@ +:original_name: alm_14009.html + +.. _alm_14009: + +ALM-14009 Number of Faulty DataNodes Exceeds the Threshold +========================================================== + +Description +----------- + +The system periodically checks the number of faulty DataNodes in the HDFS cluster every 30 seconds, and compares the number with the threshold. The number of faulty DataNodes has a default threshold. This alarm is generated when the number of faulty DataNodes in the HDFS cluster exceeds the threshold. + +This alarm is cleared when the number of faulty DataNodes in the HDFS cluster is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14009 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Faulty DataNodes cannot provide HDFS services. + +Possible Causes +--------------- + +- DataNodes are faulty or overloaded. +- The network between the NameNode and the DataNode is disconnected or busy. +- NameNodes are overloaded. + +Procedure +--------- + +#. Check whether DataNodes are faulty. + + a. Use the client on the cluster node and run the **hdfs dfsadmin -report** command to check whether DataNodes are faulty. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`2.a `. + + b. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step4: + + On the MRS cluster details page, choose **Components** > **HDFS** > **Instances** to check whether the DataNode is stopped. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services** > **HDFS** > **Instances**. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2.a `. + + c. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step5: + + Select the DataNode instance, and choose **More** > **Restart Instance** to restart it. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the status of the network between the NameNode and the DataNode. + + a. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6: + + Log in to the service IP address of the node where the faulty DataNode is located, and run the **ping** *IP address of the NameNode* command to check whether the network between the DataNode and the NameNode is abnormal. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3.a `. + + b. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step7: + + Rectify the network fault. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.a `. + +#. Check whether the DataNode is overloaded. + + a. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step8: + + On the MRS cluster details page, click **Alarms** and check whether the alarm ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold exists. + + - If yes, go to :ref:`3.b `. + - If no, go to :ref:`4.a `. + + b. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step13: + + Follow procedures in :ref:`ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold ` to handle the alarm and check whether the alarm is cleared. + + - If yes, go to :ref:`3.c `. + - If no, go to :ref:`4.a `. + + c. .. _alm_14009__en-us_topic_0191813881_ss10: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4.a `. + +#. Check whether the NameNode is overloaded. + + a. .. _alm_14009__en-us_topic_0191813881_step9: + + On the MRS cluster details page, click **Alarms** and check whether the alarm ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold exists. + + - If yes, go to :ref:`4.b `. + - If no, go to :ref:`5 `. + + b. .. _alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step14: + + Follow procedures in :ref:`ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold ` to handle the alarm and check whether the alarm is cleared. + + - If yes, go to :ref:`4.c `. + - If no, go to :ref:`5 `. + + c. .. _alm_14009__en-us_topic_0191813881_ss13: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_14009__en-us_topic_0191813881_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst new file mode 100644 index 0000000..b663062 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14010_nameservice_service_is_abnormal.rst @@ -0,0 +1,171 @@ +:original_name: alm_14010.html + +.. _alm_14010: + +ALM-14010 NameService Service Is Abnormal +========================================= + +Description +----------- + +The system checks the NameService service status every 180 seconds. This alarm is generated when the NameService service is unavailable. + +This alarm is cleared when the NameService service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14010 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+---------------------------------------------------------------------+ +| Parameter | Description | ++=============+=====================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ +| NSName | Specifies the NameService service for which the alarm is generated. | ++-------------+---------------------------------------------------------------------+ + +Impact on the System +-------------------- + +HDFS fails to provide services for upper-layer components based on the NameService service, such as HBase and MapReduce. As a result, users cannot read or write files. + +Possible Causes +--------------- + +- The JournalNode is faulty. +- The DataNode is faulty. +- The disk capacity is insufficient. +- The NameNode enters safe mode. + +Procedure +--------- + +#. Check the status of the JournalNode instance. + + a. On the MRS Manager home page, click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services**. + + b. Click **HDFS**. + + c. Click **Instance**. + + d. Check whether the **Health Status** of the JournalNode is **Good**. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.e `. + + e. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step12: + + Select the faulty JournalNode, and choose **More** > **Restart Instance**. Check whether the JournalNode successfully restarts. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`5 `. + + f. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step10: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the status of the DataNode instance. + + a. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step11: + + On the MRS cluster details page, click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Services**. + + b. Click **HDFS**. + + c. In **Operation and Health Summary**, check whether the **Health Status** of all DataNodes is **Good**. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.d `. + + d. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step14: + + Click **Instances**. On the DataNode management page, select the faulty DataNode, and choose **More** > **Restart Instance**. Check whether the DataNode successfully restarts. + + - If yes, go to :ref:`2.e `. + - If no, go to :ref:`3.a `. + + e. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step15: + + Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4.a `. + +#. Check the disk status. + + a. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step24: + + On the MRS cluster details page, click the **Nodes** tab and expand a node group. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Hosts**. + + b. In the **Disk Usage** column, check whether disk space is insufficient. + + - If yes, go to :ref:`3.c `. + - If no, go to :ref:`4.a `. + + c. .. _alm_14010__en-us_topic_0191813899_alm14010_mmccppss_step26: + + Expand the disk capacity. + + d. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4.a `. + +#. Check whether NameNode is in the safe mode. + + a. .. _alm_14010__en-us_topic_0191813899_step28: + + Use the client on the cluster node, and run the **hdfs dfsadmin -safemode get** command to check whether **Safe mode is ON** is displayed. + + Information behind **Safe mode is ON** is alarm information and is displayed based actual conditions. + + - If yes, go to :ref:`4.b `. + - If no, go to :ref:`5 `. + + b. .. _alm_14010__en-us_topic_0191813899_li66373591: + + Use the client on the cluster node and run the **hdfs dfsadmin -safemode leave** command. + + c. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_14010__en-us_topic_0191813899_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14011_hdfs_datanode_data_directory_is_not_configured_properly.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14011_hdfs_datanode_data_directory_is_not_configured_properly.rst new file mode 100644 index 0000000..dd7f4be --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14011_hdfs_datanode_data_directory_is_not_configured_properly.rst @@ -0,0 +1,186 @@ +:original_name: alm_14011.html + +.. _alm_14011: + +ALM-14011 HDFS DataNode Data Directory Is Not Configured Properly +================================================================= + +Description +----------- + +The DataNode parameter **dfs.datanode.data.dir** specifies the DataNode data directory. This alarm is generated in any of the following scenarios: + +- A configured data directory cannot be created. +- A data directory uses the same disk as other critical directories in the system. +- Multiple directories use the same disk. + +This alarm is cleared when the DataNode data directory is configured properly and this DataNode is restarted. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14011 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the DataNode data directory is mounted on critical directories such as the root directory, the disk space of the root directory will be used up after running for a long time. This causes a system fault. + +If the DataNode data directory is not configured properly, HDFS performance will deteriorate. + +Possible Causes +--------------- + +- The DataNode data directory fails to be created. +- The DataNode data directory uses the same disk as critical directories, such as **/** or **/boot**. +- Multiple directories in the DataNode data directory use the same disk. + +Procedure +--------- + +#. Check the alarm cause and information about the DataNode for which the alarm is generated. + + a. On the MRS cluster details page, click **Alarms**. In the alarm list, click the alarm. + b. In the **Alarm Details** area, view **Alarm Cause** to obtain the cause of the alarm. In **HostName** of **Location**, obtain the host name of the DataNode for which the alarm is generated. + +#. Delete directories that do not comply with the disk plan from the DataNode data directory. + + a. Choose **Components** > **HDFS** > **Instances**. In the instance list, click the DataNode instance on the node for which the alarm is generated. + + b. Click **Instance Configuration** and view the value of the DataNode parameter **dfs.datanode.data.dir**. + + c. Check whether all DataNode data directories are consistent with the disk plan. + + - If yes, go to :ref:`2.d `. + - If no, go to :ref:`2.g `. + + d. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_alm14011_mmccppss_s6: + + Modify the DataNode parameter **dfs.datanode.data.dir** and delete the incorrect directories. + + e. Choose **Components** > **HDFS** > **Instances** to restart the DataNode instance. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.g `. + + g. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s9: + + Log in to the DataNode for which the alarm is generated. + + - If the alarm cause is "The DataNode data directory fails to be created", go to :ref:`3.a `. + - If the alarm cause is "The DataNode data directory uses the same disk as critical directories, such **/** or **/boot**", go to :ref:`4.a `. + - If the alarm cause is "Multiple directories in the DataNode data directory use the same disk", go to :ref:`5.a `. + +#. Check whether the DataNode data directory fails to be created. + + a. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_alm14011_mmccppss_s10: + + Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + b. Run the **ls** command to check whether the directories exist in the DataNode data directory. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`3.c `. + + c. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_alm14011_mmccppss_s12: + + Run the **mkdir** *data directory* command to create a directory and check whether the directory is successfully created. + + - If yes, go to :ref:`6.a `. + - If no, go to :ref:`3.d `. + + d. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s1233: + + Click **Alarms** to check whether alarm ALM-12017 Insufficient Disk Capacity exists. + + - If yes, go to :ref:`3.e `. + - If no, go to :ref:`3.f `. + + e. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s154: + + Adjust the disk capacity and check whether alarm ALM-12017 Insufficient Disk Capacity is cleared. For details, see :ref:`ALM-12017 Insufficient Disk Capacity `. + + - If yes, go to :ref:`ALM-12017 Insufficient Disk Capacity `. + - If no, go to :ref:`7 `. + + f. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_alm14011_mmccppss_s13: + + Check whether user **omm** has the **rwx** or **x** permission of all the upper-layer directories of the directory. (For example, for **/tmp/abc/**, user **omm** has the **x** permission for directory **tmp** and the **rwx** permission for directory **abc**.) + + - If yes, go to :ref:`6.a `. + - If no, go to :ref:`3.g `. + + g. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s14: + + Run the **chmod u+rwx** *path* or **chmod u+x** *path* command as the **root** user to add the **rwx** or **x** permission to the paths. Then, go to :ref:`3.c `. + +#. Check whether the DataNode data directory uses the same disk as other critical directories in the system. + + a. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s16: + + Run the **df** command to obtain the disk mounting information of each directory in the DataNode data directory. + + b. Check whether the directories mounted to the disk are critical directories, such as **/** or **/boot**. + + - If yes, go to :ref:`4.c `. + - If no, go to :ref:`6.a `. + + c. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s18: + + Change the value of the DataNode parameter **dfs.datanode.data.dir** and delete the directories that use the same disk as critical directories. + + d. Go to :ref:`6.a `. + +#. Check whether multiple directories in the DataNode data directory use the same disk. + + a. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s20: + + Run the **df** command to obtain the disk mounting information of each directory in the DataNode data directory. Record the mounted directory in the command output. + + b. Modify the DataNode node parameter **dfs.datanode.data.dir** to reserve one of the directories mounted on the same disk directory. + + c. Go to :ref:`6.a `. + +#. Restart the DataNode and check whether the alarm is cleared. + + a. .. _alm_14011__en-us_topic_0191813967_en-us_topic_0035998730_s23: + + Choose **Components** > **HDFS** > **Instances** to restart the DataNode instance. + + b. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +#. .. _alm_14011__en-us_topic_0191813967_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14012_hdfs_journalnode_data_is_not_synchronized.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14012_hdfs_journalnode_data_is_not_synchronized.rst new file mode 100644 index 0000000..f271030 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-14012_hdfs_journalnode_data_is_not_synchronized.rst @@ -0,0 +1,122 @@ +:original_name: alm_14012.html + +.. _alm_14012: + +ALM-14012 HDFS JournalNode Data Is Not Synchronized +=================================================== + +Description +----------- + +On the active NameNode, the system checks data synchronization on all JournalNodes in the cluster every 5 minutes. This alarm is generated when data on a JournalNode is not synchronized with that on other JournalNodes. + +This alarm is cleared in 5 minutes after data on JournalNodes is synchronized. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +14012 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------+------------------------------------------------------------------------------------------------+ +| Parameter | Description | ++=============+================================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------------------------+ +| IP | Specifies the service IP address of the JournalNode instance for which the alarm is generated. | ++-------------+------------------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +When a JournalNode is working incorrectly, data on the node is not synchronized with that on other JournalNodes. If data on more than half of JournalNodes is not synchronized, the NameNode cannot work correctly, making the HDFS service unavailable. + +Possible Causes +--------------- + +- The JournalNode instance has not been started or has been stopped. +- The JournalNode instance is working incorrectly. +- The network of the JournalNode is unreachable. + +Procedure +--------- + +#. Check whether the JournalNode instance has been started. + + a. On the MRS cluster details page, click **Alarms**. In the alarm list, click the alarm. + + b. In the **Alarm Details** area, check **Location** and obtain the IP address of the JournalNode for which the alarm is generated. + + c. Choose **Components** > **HDFS** > **Instances**. In the instance list, click the JournalNode for which the alarm is generated and check whether **Operating Status** of the node is **Started**. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.d `. + + d. .. _alm_14012__en-us_topic_0191813911_alm14012_mmccppss_s4: + + Select the JournalNode instance and choose **More** > **Start Instance** to start it. + + e. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. Check whether the JournalNode instance is working correctly. + + a. .. _alm_14012__en-us_topic_0191813911_alm14012_mmccppss_s6: + + Check whether **Health Status** of the JournalNode instance is **Good**. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.b `. + + b. .. _alm_14012__en-us_topic_0191813911_s7: + + Select the JournalNode instance and choose **More** > **Restart Instance** to restart it. + + c. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. Check whether the network of the JournalNode is reachable. + + a. .. _alm_14012__en-us_topic_0191813911_alm14012_mmccppss_s10: + + On the MRS cluster details page, choose **Components** > **HDFS** > **Instances** to check the service IP address of the active NameNode. + + b. Log in to the active NameNode. + + c. Run the **ping** command to check whether a timeout occurs or the network between the active NameNode and the JournalNode is unreachable. + + **ping** *service IP address of the JournalNode* + + - If yes, go to :ref:`3.d `. + - If no, go to :ref:`4 `. + + d. .. _alm_14012__en-us_topic_0191813911_alm14012_mmccppss_s13: + + Contact O&M personnel to rectify the network fault. Wait 5 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_14012__en-us_topic_0191813911_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_the_maximum_number_allowed_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_the_maximum_number_allowed_exceeds_the_threshold.rst new file mode 100644 index 0000000..d37da8b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_the_maximum_number_allowed_exceeds_the_threshold.rst @@ -0,0 +1,77 @@ +:original_name: alm_16000.html + +.. _alm_16000: + +ALM-16000 Percentage of Sessions Connected to the HiveServer to the Maximum Number Allowed Exceeds the Threshold +================================================================================================================ + +Description +----------- + +The system checks the percentage of sessions connected to the HiveServer to the maximum number allowed every 30 seconds. This indicator can be viewed on the Hive service monitoring page. This alarm is generated when the the percentage of sessions connected to the HiveServer to the maximum number allowed exceeds the specified threshold (90% by default). + +This alarm can be automatically cleared when the percentage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16000 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +If a connection alarm is generated, too many sessions are connected to the HiveServer and new connections cannot be created. + +Possible Causes +--------------- + +Too many clients are connected to the HiveServer. + +Procedure +--------- + +#. Increase the maximum number of connections to Hive. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Hive** > **Service Configuration** and switch **Basic** to **All**. + c. Increase the value of the **hive.server.session.control.maxconnections** configuration item. Suppose the value of the configuration item is A, the threshold is B, and sessions connected to the HiveServer is C. Adjust the value of the configuration item according to A x B > C. Sessions connected to the HiveServer can be viewed on the Hive monitoring page. + d. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_16000__en-us_topic_0191813892_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst new file mode 100644 index 0000000..864c069 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16001_hive_warehouse_space_usage_exceeds_the_threshold.rst @@ -0,0 +1,110 @@ +:original_name: alm_16001.html + +.. _alm_16001: + +ALM-16001 Hive Warehouse Space Usage Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the Hive warehouse space usage every 30 seconds. The indicator **Percentage of HDFS Space Used by Hive to the Available Space** can be viewed on the Hive service monitoring page. This alarm is generated when the Hive warehouse space usage exceeds the specified threshold (85% by default). + +This alarm is cleared when the Hive warehouse space usage is less than or equal to the threshold. You can reduce the warehouse space usage by expanding the warehouse capacity or releasing the used space. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +The system fails to write data, which causes data loss. + +Possible Causes +--------------- + +- The upper limit of the HDFS capacity available for Hive is too small. +- The system disk space is insufficient. +- Some data nodes break down. + +Procedure +--------- + +#. Expand the system configuration. + + a. Analyze the cluster HDFS capacity usage and increase the upper limit of the HDFS capacity available for Hive. + + Go to the MRS cluster details page, choose **Components** > **Hive** > **Service Configuration**, set **Type** to **All**, search for **hive.metastore.warehouse.size.percent**, and increase the value of this parameter. Suppose that the value of the configuration item is A, total HDFS storage space is B, the threshold is C, and HDFS space used by Hive is D. Adjust the value of the configuration item according to A x B x C > D. The total HDFS storage space can be viewed on the HDFS monitoring page, and HDFS space used by Hive can be viewed on the Hive monitoring page. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and choose **Components** > **Hive** > **Service Configuration**. + + b. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Expand the system. + + a. .. _alm_16001__en-us_topic_0191813901_s332: + + Add nodes. + + b. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.a `. + +#. Check whether the data node is normal. + + a. .. _alm_16001__en-us_topic_0191813901_li51692872: + + Go to the cluster details page and choose **Alarms**. + + b. Check whether ALM-12006 Node Fault, ALM-12007 Process Fault, or ALM-14002 DataNode Disk Usage Exceeds the Threshold exists. + + - If yes, go to :ref:`3.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_16001__en-us_topic_0191813901_aalm-16001_mmccppss_step5: + + Clear the alarm by following the steps provided in ALM-12006 Node Fault, ALM-12007 Process Fault, or ALM-14002 DataNode Disk Usage Exceeds the Threshold. + + d. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_16001__en-us_topic_0191813901_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..215eb64 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold.rst @@ -0,0 +1,156 @@ +:original_name: alm_16002.html + +.. _alm_16002: + +ALM-16002 Hive SQL Execution Success Rate Is Lower Than the Threshold +===================================================================== + +Description +----------- + +The system checks the percentage of the HiveQL statements that are executed successfully every 30 seconds. Percentage of HiveQL statements that are executed successfully = Number of HiveQL statements that are executed successfully by Hive in a specified period/Total number of HiveQL statements that are executed by Hive. This indicator can be viewed on the Hive service monitoring page. This alarm is generated when the percentage of the HiveQL statements that are executed successfully exceeds the specified threshold (90% by default). The name of the host for which the alarm is generated can be obtained from the location information of the alarm. The host IP address is the IP address of the HiveServer node. + +This alarm is cleared when the percentage of the HiveQL statements that are executed successfully in a test period is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +The system configuration and performance cannot meet service processing requirements. + +Possible Causes +--------------- + +- A syntax error occurs in HiveQL commands. +- The HBase service is abnormal when a Hive on HBase task is being performed. +- Basic services that are depended on are abnormal, such as HDFS, Yarn, and ZooKeeper. + +Procedure +--------- + +#. Check whether the HiveQL commands comply with syntax. + + a. Use the Hive client to log in to the HiveServer node for which the alarm is generated. Query the HiveQL syntax standard provided by Apache, and check whether the HiveQL commands are correct. For details, see https://cwiki.apache.org/confluence/display/hive/languagemanual. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.b `. + + .. note:: + + To view the user who runs an incorrect statement, download HiveServerAudit logs of the HiveServer node for which this alarm is generated. Set **Start time** and **End time** to 10 minutes before and after the alarm generation time respectively. Open the log file and search for the **Result=FAIL** keyword to filter the log information about the incorrect statement, and then view the user who runs the incorrect statement according to **UserName** in the log information. + + b. .. _alm_16002__en-us_topic_0191813927_aalm-16002_mmccppss_step2: + + Enter correct HiveQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`4.e `. + - If no, go to :ref:`2.a `. + +#. Check whether the HBase service is abnormal. + + a. .. _alm_16002__en-us_topic_0191813927_step11: + + Check whether a Hive on HBase task is performed. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3.a `. + + b. .. _alm_16002__en-us_topic_0191813927_aalm-16002_mmccppss_step12: + + Check whether the HBase service is normal in the service list. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.c `. + + c. .. _alm_16002__en-us_topic_0191813927_aalm-16002_mmccppss_step_15: + + Check the alarms displayed on the alarm page and clear them according to **Alarm Help**. + + d. Enter correct HiveQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`4.e `. + - If no, go to :ref:`3.a `. + +#. Check whether the Spark service is abnormal. + + a. .. _alm_16002__en-us_topic_0191813927_step22: + + Check whether the Spark service is normal in the service list. + + - If yes, go to :ref:`4.a `. + - If no, go to :ref:`3.b `. + + b. .. _alm_16002__en-us_topic_0191813927_step_25: + + Check the alarms displayed on the alarm page and clear them according to **Alarm Help**. + + c. Enter correct HiveQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`4.e `. + - If no, go to :ref:`4.a `. + +#. Check whether HDFS, Yarn, and ZooKeeper are normal. + + a. .. _alm_16002__en-us_topic_0191813927_li51692872: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. In the service list, check whether the services, such as HDFS, Yarn, and ZooKeeper are normal. + + - If yes, go to :ref:`4.e `. + - If no, go to :ref:`4.c `. + + c. .. _alm_16002__en-us_topic_0191813927_aalm-16002_mmccppss_step_5: + + Check the alarms displayed on the alarm page and clear them according to **Alarm Help**. + + d. Enter correct HiveQL statements, and check whether the command can be properly executed. + + - If yes, go to :ref:`4.e `. + - If no, go to :ref:`5 `. + + e. .. _alm_16002__en-us_topic_0191813927_step_6: + + Wait one minute and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_16002__en-us_topic_0191813927_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16004_hive_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16004_hive_service_unavailable.rst new file mode 100644 index 0000000..1fce38a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16004_hive_service_unavailable.rst @@ -0,0 +1,213 @@ +:original_name: alm_16004.html + +.. _alm_16004: + +ALM-16004 Hive Service Unavailable +================================== + +Description +----------- + +The system checks the Hive service status every 30 seconds. This alarm is generated when the Hive service is unavailable. + +This alarm is cleared when the Hive service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16004 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system cannot provide data loading, query, and extraction services. + +Possible Causes +--------------- + +- Basic services, such as ZooKeeper, HDFS, Yarn, and DBService work incorrectly, or the Hive process is faulty. + + - ZooKeeper is abnormal. + - HDFS is abnormal. + - Yarn is abnormal. + - DBService is abnormal. + - The Hive service process is faulty. If the alarm is caused by a Hive process fault, the alarm report has a delay of about 5 minutes. + +- The network communication between the Hive service and basic services is interrupted. + +Procedure +--------- + +#. Check the HiveServer/MetaStore process status. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Hive** > **Instances**. In the Hive instance list, check whether the status of all HiveSserver/MetaStore instances is **Unknown**. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2 `. + + c. .. _alm_16004__en-us_topic_0191813910_li15736882153452: + + Above the Hive instance list, choose **More** > **Restart Instance** to restart the HiveServer/MetaStore process. + + d. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_16004__en-us_topic_0191813910_li63276134153458: + + Check the ZooKeeper status. + + a. Go to the cluster details page and choose **Alarms**. + + b. On MRS Manager, check whether the ALM-12007 Process Fault alarm is reported. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_16004__en-us_topic_0191813910_li17867059153452: + + In the **Alarm Details** area of ALM-12007 Process Fault, check whether **ServiceName** is **ZooKeeper**. + + - If yes, go to :ref:`2.d `. + - If no, go to :ref:`3 `. + + d. .. _alm_16004__en-us_topic_0191813910_li26585804153452: + + Rectify the fault by following steps provided in ALM-12007 Process Fault. + + e. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_16004__en-us_topic_0191813910_li315441715352: + + Check the HDFS status. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, check whether the alarm ALM-14000 HDFS Service Unavailable exists. + + - If yes, go to :ref:`3.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_16004__en-us_topic_0191813910_li2196200153452: + + Rectify the fault by following the steps provided in ALM-14000 HDFS Service Unavailable. + + d. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_16004__en-us_topic_0191813910_li3789476315357: + + Check the Yarn status. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list on MRS Manager, check whether the alarm ALM-18000 Yarn Service Unavailable is generated. + + - If yes, go to :ref:`4.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_16004__en-us_topic_0191813910_li64260695153452: + + Rectify the fault by following the steps provided in ALM-18000 Yarn Service Unavailable. + + d. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. Check the DBService status. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list on MRS Manager, check whether ALM-27001 DBService Unavailable is generated. + + - If yes, go to :ref:`5.c `. + - If no, go to :ref:`6 `. + + c. .. _alm_16004__en-us_topic_0191813910_li19704975153452: + + Rectify the fault by following the handling procedure in :ref:`ALM-27001 DBService Is Unavailable `. + + d. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6 `. + +#. .. _alm_16004__en-us_topic_0191813910_li23165657153517: + + Check the network connection between Hive and ZooKeeper, HDFS, Yarn, and DBService. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Click **Hive**. + + c. Click **Instances**. + + The HiveServer instance list is displayed. + + d. Click **Host Name** in the row of **HiveServer**. + + The HiveServer host status page is displayed. + + e. .. _alm_16004__en-us_topic_0191813910_li39788839153452: + + Record the IP address under **Summary**. + + f. Use the IP address obtained in :ref:`6.e ` to log in to the host where HiveServer is located. + + g. Run the **ping** command to check whether the network connection between the host that runs HiveServer and the hosts that run the ZooKeeper, HDFS, Yarn, and DBService services is normal. Methods of obtaining IP addresses of the hosts that run ZooKeeper, HDFS, Yarn, and DBService services as well as the HiveServer IP address are the same. + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6.h `. + + h. .. _alm_16004__en-us_topic_0191813910_li44761520153452: + + Contact the O&M personnel to restore the network. + + i. In the alarm list, check whether ALM-16004 Hive Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +#. .. _alm_16004__en-us_topic_0191813910_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16005_number_of_failed_hive_sql_executions_in_the_last_period_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16005_number_of_failed_hive_sql_executions_in_the_last_period_exceeds_the_threshold.rst new file mode 100644 index 0000000..cd4945c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-16005_number_of_failed_hive_sql_executions_in_the_last_period_exceeds_the_threshold.rst @@ -0,0 +1,51 @@ +:original_name: alm_16005.html + +.. _alm_16005: + +ALM-16005 Number of Failed Hive SQL Executions in the Last Period Exceeds the Threshold +======================================================================================= + +Description +----------- + +The system checks whether the number of Hive SQL statements that fail to be executed has exceeded the threshold in the last 10-minute period. This alarm is generated when the number of failed Hive SQL statement executions in the last 10 minutes is greater than the threshold. In the next 10 minutes, if the number of failed Hive SQL statement executions is less than the threshold, the alarm is automatically cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +16005 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +None + +Possible Causes +--------------- + +The Hive SQL syntax is incorrect. As a result, the Hive SQL statements fail to be executed. + +Procedure +--------- + +Check the Hive SQL statements that fail to be executed, correct the syntax, and execute the SQL statements again. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18000_yarn_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18000_yarn_service_unavailable.rst new file mode 100644 index 0000000..64bddbd --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18000_yarn_service_unavailable.rst @@ -0,0 +1,136 @@ +:original_name: alm_18000.html + +.. _alm_18000: + +ALM-18000 Yarn Service Unavailable +================================== + +Description +----------- + +The alarm module checks the Yarn service status every 30 seconds. This alarm is generated when the Yarn service is unavailable. + +This alarm is cleared when the Yarn service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide the Yarn service. Users cannot run new applications. Submitted applications cannot be run. + +Possible Causes +--------------- + +- ZooKeeper is abnormal. +- HDFS is abnormal. +- There is no active ResourceManager node in the Yarn cluster. +- All NodeManager nodes in the Yarn cluster are abnormal. + +Procedure +--------- + +#. Check the ZooKeeper status. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, check whether the alarm ALM-13000 ZooKeeper Service Unavailable exists. + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2.b `. + + c. .. _alm_18000__en-us_topic_0191813947_aalm-18000_mmccppss_ss2: + + Rectify the fault by following the handling procedure in :ref:`ALM-13000 ZooKeeper Service Unavailable `. Then, check whether this alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.b `. + +#. Check the HDFS status. + + a. Go to the cluster details page and choose **Alarms**. + + b. .. _alm_18000__en-us_topic_0191813947_aalm-18000_mmccppss_ss3: + + In the alarm list, check whether an HDFS alarm is generated. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3.b `. + + c. .. _alm_18000__en-us_topic_0191813947_aalm-18000_mmccppss_ss4: + + Click **Alarms**, and handle HDFS alarms according to **Alarm Help**. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.b `. + +#. Check the ResorceManager status in the Yarn cluster. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_18000__en-us_topic_0191813947_aalm-18000_mmccppss_ss5: + + Click **Yarn**. + + c. In **Yarn Summary**, check whether there is an active ResourceManager node in the Yarn cluster. + + - If yes, go to :ref:`4.b `. + - If no, go to :ref:`5 `. + +#. Check the NodeManager node status in the Yarn cluster. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_18000__en-us_topic_0191813947_step_5: + + Choose **Yarn** > **Instances**. + + c. Check **Health Status** of NodeManager, and check whether there are unhealthy nodes. + + - If yes, go to :ref:`4.d `. + - If no, go to :ref:`5 `. + + d. .. _alm_18000__en-us_topic_0191813947_aalm-18000_mmccppss_step_7: + + Rectify the fault by following the procedure provided in :ref:`ALM-18002 NodeManager Heartbeat Lost ` or :ref:`ALM-18003 NodeManager Unhealthy `. Then, check whether this alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_18000__en-us_topic_0191813947_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst new file mode 100644 index 0000000..901860c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18002_nodemanager_heartbeat_lost.rst @@ -0,0 +1,64 @@ +:original_name: alm_18002.html + +.. _alm_18002: + +ALM-18002 NodeManager Heartbeat Lost +==================================== + +Description +----------- + +The system checks the number of lost NodeManager nodes every 30 seconds, and compares the number of lost nodes with the threshold. The **Lost Nodes** indicator has a default threshold. This alarm is generated when the value of the **Lost Nodes** indicator exceeds the threshold. + +This alarm is cleared when the value of **Lost Nodes** is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The lost NodeManager node cannot provide the Yarn service. +- The number of containers decreases, so the cluster performance deteriorates. + +Possible Causes +--------------- + +- NodeManager is forcibly deleted without decommission. +- All NodeManager instances are stopped or the NodeManager process is faulty. +- The host where the NodeManager node resides is faulty. +- The network between the NodeManager and ResourceManager is disconnected or busy. + +Procedure +--------- + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18003_nodemanager_unhealthy.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18003_nodemanager_unhealthy.rst new file mode 100644 index 0000000..3326e42 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18003_nodemanager_unhealthy.rst @@ -0,0 +1,62 @@ +:original_name: alm_18003.html + +.. _alm_18003: + +ALM-18003 NodeManager Unhealthy +=============================== + +Description +----------- + +The system checks the number of abnormal NodeManager nodes every 30 seconds, and compares the number of abnormal nodes with the threshold. The **Unhealthy Nodes** indicator has a default threshold. This alarm is generated when the value of the **Unhealthy Nodes** indicator exceeds the threshold. + +This alarm is cleared when the value of **Unhealthy Nodes** is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- The faulty NodeManager node cannot provide the Yarn service. +- The number of containers decreases, so the cluster performance deteriorates. + +Possible Causes +--------------- + +- The disk space of the host where the NodeManager node resides is insufficient. +- User **omm** does not have the permission to access a local directory on the NodeManager node. + +Procedure +--------- + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18004_nodemanager_disk_usability_ratio_is_lower_than_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18004_nodemanager_disk_usability_ratio_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..8dc9979 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18004_nodemanager_disk_usability_ratio_is_lower_than_the_threshold.rst @@ -0,0 +1,62 @@ +:original_name: alm_18004.html + +.. _alm_18004: + +ALM-18004 NodeManager Disk Usability Ratio Is Lower Than the Threshold +====================================================================== + +Description +----------- + +The system checks the available disk space of each NodeManager node every 30 seconds and compares the disk availability rate with the threshold. A default threshold range is provided for the **NodeManager Disk Usability Ratio**. This alarm is generated when the system detects that the actual **NodeManager Disk Usability Ratio** is lower than the threshold. + +This alarm is automatically cleared when the value of **NodeManager Disk Usability Ratio** is greater than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18004 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +- The NodeManager node whose disk availability rate is lower than the threshold may fail to provide the Yarn service. +- The number of containers decreases, so the cluster performance may deteriorate. + +Possible Causes +--------------- + +- The disk space of the host where the NodeManager node resides is insufficient. +- User **omm** does not have the permission to access a local directory on the NodeManager node. + +Procedure +--------- + +#. Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18006_mapreduce_job_execution_timeout.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18006_mapreduce_job_execution_timeout.rst new file mode 100644 index 0000000..9d55e2f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18006_mapreduce_job_execution_timeout.rst @@ -0,0 +1,104 @@ +:original_name: alm_18006.html + +.. _alm_18006: + +ALM-18006 MapReduce Job Execution Timeout +========================================= + +Description +----------- + +The alarm module checks the MapReduce job execution every 30 seconds. This alarm is generated when the execution of a submitted MapReduce job times out. + +This alarm must be manually cleared. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18006 Major No +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Execution of the submitted MapReduce job times out, so no execution result can be obtained. Execute the job again after rectifying the fault. + +Possible Causes +--------------- + +It takes a long time to execute a MapReduce job. However, the specified time is less than the required execution time. + +Procedure +--------- + +#. Check whether time is improperly set. + + Set **-Dapplication.timeout.interval** to a larger value, or do not set the parameter. Check whether the MapReduce job can be executed. + + - If yes, go to :ref:`2.e `. + - If no, go to :ref:`2.b `. + +#. Check the Yarn status. + + a. Go to the cluster details page and choose **Alarms**. + + b. .. _alm_18006__en-us_topic_0191813946_substep_03d21a89: + + In the alarm list on MRS Manager, check whether the alarm ALM-18000 Yarn Service Unavailable is generated. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_18006__en-us_topic_0191813946_substep_03d82569: + + Rectify the fault by following the handling procedure in :ref:`ALM-18000 Yarn Service Unavailable `. + + d. Run the MapReduce job command again to check whether the MapReduce job can be executed. + + - If yes, go to :ref:`2.e `. + - If no, go to :ref:`4 `. + + e. .. _alm_18006__en-us_topic_0191813946_clean: + + In the alarm list, click |image1| in the **Operation** column of the alarm to manually clear the alarm. No further action is required. + +#. .. _alm_18006__en-us_topic_0191813946_li12092809151957: + + Adjust the timeout threshold. + + On MRS Manager, choose **System** > **Threshold Configuration** > **Services** > **Yarn** > **Timed out Applications**, and increase the maximum number of timeout tasks allowed by the current threshold rule. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_18006__en-us_topic_0191813946_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001349257373.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18008_heap_memory_usage_of_yarn_resourcemanager_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18008_heap_memory_usage_of_yarn_resourcemanager_exceeds_the_threshold.rst new file mode 100644 index 0000000..5dadbf6 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18008_heap_memory_usage_of_yarn_resourcemanager_exceeds_the_threshold.rst @@ -0,0 +1,84 @@ +:original_name: alm_18008.html + +.. _alm_18008: + +ALM-18008 Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold +========================================================================= + +Description +----------- + +The system checks the heap memory usage of Yarn ResourceManager every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of Yarn ResourceManager exceeds the threshold (80% of the maximum memory by default). + +To change the threshold, choose **System** > **Threshold Configuration** > **Service** > **Yarn**. The alarm is cleared when the heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18008 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of Yarn ResourceManager is overhigh, the performance of Yarn task submission and operation is affected. What is more, a memory overflow occurs so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the Yarn ResourceManager instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the MRS cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **18008** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Yarn** > **Instances** > **ResourceManager** (IP address of the instance for which the alarm is generated) > **Customize** > **Percentage of Used Heap Memory of the ResourceManager**. Check the heap memory usage. + + d. Check whether the heap memory usage of ResourceManager has reached the threshold (80% of the maximum memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_18008__en-us_topic_0191813919_li1011493181634: + + Choose **Components** > **Yarn** > **Service Configuration**. Set **Type** to **All** and choose **ResourceManager** > **System**. Change the values of **-Xmx** and **-Xms** in the **GC_OPTS** parameter based on the site requirements to ensure that the value of **-Xms** is less than that of **-Xmx**. Click **Save Configuration** and select **Restart Role Instance**. Click **OK**. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_18008__en-us_topic_0191813919_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18009_heap_memory_usage_of_mapreduce_jobhistoryserver_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18009_heap_memory_usage_of_mapreduce_jobhistoryserver_exceeds_the_threshold.rst new file mode 100644 index 0000000..b2d4718 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18009_heap_memory_usage_of_mapreduce_jobhistoryserver_exceeds_the_threshold.rst @@ -0,0 +1,84 @@ +:original_name: alm_18009.html + +.. _alm_18009: + +ALM-18009 Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold +=============================================================================== + +Description +----------- + +The system checks the heap memory usage of MapReduce JobHistoryServer every 30 seconds and compares the actual usage with the threshold. The alarm is generated when the heap memory usage of MapReduce JobHistoryServer exceeds the threshold (80% of the maximum memory by default). + +To change the threshold, choose **System** > **Threshold Configuration** > **Service** > **MapReduce**. The alarm is cleared when the heap memory usage is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18009 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +When the heap memory usage of MapReduce JobHistoryServer is overhigh, the performance of MapReduce log archiving is affected. What is more, a memory overflow occurs so that the Yarn service is unavailable. + +Possible Causes +--------------- + +The heap memory of the MapReduce JobHistoryServer instance on the node is overused or the heap memory is inappropriately allocated. As a result, the usage exceeds the threshold. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **18009** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **MapReduce** > **Instance** > **JobHistoryServer** (IP address of the instance for which the alarm is generated) > **Customize** > **JobHistoryServer Heap Memory Usage Statistics**. Check the heap memory usage. + + d. Check whether the heap memory usage of JobHistoryServer has reached the threshold (80% of the maximum heap memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_18009__en-us_topic_0191813867_li1011493181634: + + Choose **Components** > **MapReduce** > **Service Configuration**. Set **Type** to **All** and choose **JobHistoryServer** > **System**. Increase the value of **-Xmx** in the **GC_OPTS** parameter as required, click **Save Configuration**, and select **Restart the affected services or instances.** Click **OK**. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_18009__en-us_topic_0191813867_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18010_number_of_pending_yarn_tasks_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18010_number_of_pending_yarn_tasks_exceeds_the_threshold.rst new file mode 100644 index 0000000..8398b8d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18010_number_of_pending_yarn_tasks_exceeds_the_threshold.rst @@ -0,0 +1,91 @@ +:original_name: alm_18010.html + +.. _alm_18010: + +ALM-18010 Number of Pending Yarn Tasks Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the number of pending Yarn tasks every 30 seconds and compares the number of tasks with the threshold. This alarm is generated when the number of pending tasks exceeds the threshold. + +You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Yarn** > **Queue Root Pending Applications** > **Queue Root Pending Applications** on MRS Manager. + +This alarm is cleared when the number of pending tasks is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18010 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Tasks may be stacked and cannot be processed in a timely manner. + +Possible Causes +--------------- + +The computing capability of the cluster is lower than the task submission rate. As a result, the task cannot be processed in a timely manner after being submitted. + +Procedure +--------- + +#. Check the usage of memory and vCores on the Yarn page. + + Check whether the values of **Memory Used|Memory Total** and **VCores Used|VCores Total** on the native Yarn page reach or approach the maximum values. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`5 `. + +#. .. _alm_18010__en-us_topic_0227101889_li181801656143013: + + Check the number of submitted tasks. + + Check whether the running tasks are submitted at a normal frequency. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm_18010__en-us_topic_0227101889_li10509161210322: + + Scale out the cluster. + + The scale-out is based on the site requirements. For details, see :ref:`Manually Scaling Out a Cluster `. + +#. After the scale-out is completed, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_18010__en-us_topic_0227101889_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18011_memory_of_pending_yarn_tasks_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18011_memory_of_pending_yarn_tasks_exceeds_the_threshold.rst new file mode 100644 index 0000000..b260412 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18011_memory_of_pending_yarn_tasks_exceeds_the_threshold.rst @@ -0,0 +1,91 @@ +:original_name: alm_18011.html + +.. _alm_18011: + +ALM-18011 Memory of Pending Yarn Tasks Exceeds the Threshold +============================================================ + +Description +----------- + +The system checks the memory of pending Yarn tasks every 30 seconds and compares the memory with the threshold. This alarm is generated when the memory of pending tasks exceeds the threshold. + +You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Yarn** > **Queue Root Pending Memory** > **Queue Root Pending Memory** on MRS Manager. + +This alarm is cleared when the memory of pending tasks is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +18011 Major Yes +======== ============== ===================== + +Parameters +---------- + ++-------------------+---------------------------------------------------------+ +| Parameter | Description | ++===================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+---------------------------------------------------------+ +| Trigger Condition | Specifies the threshold for triggering the alarm. | ++-------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +Tasks may be stacked and cannot be processed in a timely manner. + +Possible Causes +--------------- + +The computing capability of the cluster is lower than the task submission rate. As a result, the task cannot be processed in a timely manner after being submitted. + +Procedure +--------- + +#. Check the usage of memory and vCores on the Yarn page. + + Check whether the values of **Memory Used|Memory Total** and **VCores Used|VCores Total** on the native Yarn page reach or approach the maximum values. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`5 `. + +#. .. _alm_18011__en-us_topic_0227101910_li181801656143013: + + Check the number of submitted tasks. + + Check whether the running tasks are submitted at a normal frequency. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`5 `. + +#. .. _alm_18011__en-us_topic_0227101910_li10509161210322: + + Scale out the cluster. + + The scale-out is based on the site requirements. For details, see :ref:`Manually Scaling Out a Cluster `. + +#. After the scale-out is completed, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +#. .. _alm_18011__en-us_topic_0227101910_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18012_number_of_terminated_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18012_number_of_terminated_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst new file mode 100644 index 0000000..de00392 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18012_number_of_terminated_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst @@ -0,0 +1,51 @@ +:original_name: alm_18012.html + +.. _alm_18012: + +ALM-18012 Number of Terminated Yarn Tasks in the Last Period Exceeds the Threshold +================================================================================== + +Description +----------- + +The system checks the number of terminated Yarn tasks every 10 minutes. This alarm is generated when the number of terminated Yarn tasks in the last 10 minutes is greater than the threshold. This alarm is automatically cleared when the number of terminated Yarn tasks is less than the threshold in the next 10 minutes. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18012 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +None + +Possible Causes +--------------- + +A user manually stops a running Yarn task. + +Procedure +--------- + +Check the task termination operator in the Yarn logs and audit logs, and determine the cause of the task termination. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18013_number_of_failed_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18013_number_of_failed_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst new file mode 100644 index 0000000..0a6187e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-18013_number_of_failed_yarn_tasks_in_the_last_period_exceeds_the_threshold.rst @@ -0,0 +1,51 @@ +:original_name: alm_18013.html + +.. _alm_18013: + +ALM-18013 Number of Failed Yarn Tasks in the Last Period Exceeds the Threshold +============================================================================== + +Description +----------- + +The system checks the number of failed Yarn tasks every 10 minutes. This alarm is generated when the number of failed Yarn tasks in the last 10 minutes is greater than the threshold. This alarm is automatically cleared when the number of failed Yarn tasks is less than the threshold in the next 10 minutes. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +18013 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +None + +Possible Causes +--------------- + +The submitted Yarn job program is incorrect. For example, the parameter for Spark to submit a job is incorrect. + +Procedure +--------- + +Check the log of the failed job, locate the failure cause, modify the job, and submit the job again. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19000_hbase_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19000_hbase_service_unavailable.rst new file mode 100644 index 0000000..60e51db --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19000_hbase_service_unavailable.rst @@ -0,0 +1,105 @@ +:original_name: alm_19000.html + +.. _alm_19000: + +ALM-19000 HBase Service Unavailable +=================================== + +Description +----------- + +The alarm module checks the HBase service status every 30 seconds. This alarm is generated when the HBase service is unavailable. + +This alarm is cleared when the HBase service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Operations cannot be performed, such as reading or writing data and creating tables. + +Possible Causes +--------------- + +- ZooKeeper is abnormal. +- HDFS is abnormal. +- HBase is abnormal. +- The network is abnormal. + +Procedure +--------- + +#. Check the ZooKeeper status. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. In the service list, check whether the health status of ZooKeeper is **Good**. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.c `. + + c. .. _alm_19000__en-us_topic_0191813964_aalm-19000_mmccppss_alm-53004: + + In the alarm list, check whether the alarm ALM-13000 ZooKeeper Service Unavailable exists. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a `. + + d. .. _alm_19000__en-us_topic_0191813964_aalm-19000_mmccppss_process: + + Rectify the fault by following the steps provided in ALM-13000 ZooKeeper Service Unavailable. + + e. Wait several minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the HDFS status. + + a. .. _alm_19000__en-us_topic_0191813964_aalm-19000_mmccppss_hdfs: + + On MRS Manager, check whether the ALM-14000 HDFS Service Unavailable alarm is reported. + + - If yes, go to :ref:`2.b `. + - If no, go to :ref:`3 `. + + b. .. _alm_19000__en-us_topic_0191813964_alm: + + Rectify the fault by following the steps provided in ALM-14000 HDFS Service Unavailable. + + c. Wait several minutes and check whether the alarm is cleared. + +#. .. _alm_19000__en-us_topic_0191813964_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst new file mode 100644 index 0000000..e00839a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-19006_hbase_replication_sync_failed.rst @@ -0,0 +1,160 @@ +:original_name: alm_19006.html + +.. _alm_19006: + +ALM-19006 HBase Replication Sync Failed +======================================= + +Description +----------- + +This alarm is generated when disaster recovery (DR) data fails to be synchronized to a standby cluster. + +This alarm is cleared when DR data synchronization succeeds. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +19006 Major Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +HBase data in a cluster fails to be synchronized to the standby cluster, causing data inconsistency between active and standby clusters. + +Possible Causes +--------------- + +- The HBase service on the standby cluster is abnormal. +- The network is abnormal. + +Procedure +--------- + +#. Observe whether the system automatically clears the alarm. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, click the alarm to obtain alarm generation time from **Generated Time** in **Alarm Details**. Check whether the alarm has existed for over 5 minutes. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.c `. + + c. .. _alm_19006__en-us_topic_0191813928_step3: + + Wait 5 minutes and check whether the alarm is automatically cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the HBase service status of the standby cluster. + + a. .. _alm_19006__en-us_topic_0191813928_li1255962015108: + + Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, click the alarm and obtain **HostName** from **Location** in **Alarm Details**. + + c. Log in to the node where the HBase client of the active cluster is located. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + d. Run the **status 'replication', 'source'** command to check the synchronization status of the faulty node. + + The synchronization status of a node is as follows. + + .. code-block:: + + 10-10-10-153: + SOURCE: PeerID=abc, SizeOfLogQueue=0, ShippedBatches=2, ShippedOps=2, ShippedBytes=320, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=0, TimeStampsOfLastShippedOp=Mon Jul 18 09:53:28 CST 2016, Replication Lag=0, FailedReplicationAttempts=0 + SOURCE: PeerID=abc1, SizeOfLogQueue=0, ShippedBatches=1, ShippedOps=1, ShippedBytes=160, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=16788, TimeStampsOfLastShippedOp=Sat Jul 16 13:19:00 CST 2016, Replication Lag=16788, FailedReplicationAttempts=5 + + e. Obtain **PeerID** corresponding to a record whose **FailedReplicationAttempts** value is greater than 0. + + In the preceding step, data on the faulty node **10-10-10-153** fails to be synchronized to a standby cluster whose **PeerID** is **abc1**. + + f. .. _alm_19006__en-us_topic_0191813928_peerid: + + Run the **list_peers** command to find the cluster and the HBase instance corresponding to **PeerID**. + + .. code-block:: + + PEER_ID CLUSTER_KEY STATE TABLE_CFS + abc1 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase2 ENABLED + abc 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase ENABLED + + In the preceding information, **/hbase2** indicates that data is synchronized to the HBase2 instance of the standby cluster. + + g. In the service list of the standby cluster, check whether the health status of the HBase instance obtained in :ref:`2.f ` is **Good**. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.h `. + + h. .. _alm_19006__en-us_topic_0191813928_alm-19000: + + In the alarm list, check whether the alarm ALM-19000 HBase Service Unavailable exists. + + - If yes, go to :ref:`2.i `. + - If no, go to :ref:`3.a `. + + i. .. _alm_19006__en-us_topic_0191813928_aalm-19006_mmccppss_process: + + Rectify the fault by following the steps provided in ALM-19000 HBase Service Unavailable. + + j. Wait several minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.a `. + +#. Check the network connection between RegionServers on active and standby clusters. + + a. .. _alm_19006__en-us_topic_0191813928_li594194191119: + + Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, click the alarm and obtain **HostName** from **Location** in **Alarm Details**. + + c. Log in to the faulty RegionServer node. + + d. Run the **ping** command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3.e `. + + e. .. _alm_19006__en-us_topic_0191813928_s1: + + Contact the O&M personnel to restore the network. + + f. After the network recovers, check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_19006__en-us_topic_0191813928_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-20002_hue_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-20002_hue_service_unavailable.rst new file mode 100644 index 0000000..672fbc5 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-20002_hue_service_unavailable.rst @@ -0,0 +1,156 @@ +:original_name: alm_20002.html + +.. _alm_20002: + +ALM-20002 Hue Service Unavailable +================================= + +Description +----------- + +The system checks the Hue service status every 60 seconds. This alarm is generated if the Hue service is unavailable. + +This alarm is cleared when the Hue service is normal. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +20002 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The system cannot provide data loading, query, and extraction services. + +Possible Causes +--------------- + +- The KrbServer service on which Hue depends is abnormal. +- The DBService service on which Hue depends is abnormal. +- The network connection to DBService is abnormal. + +Procedure +--------- + +**Check whether the KrbServer service is normal.** + +#. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + +#. In the service list, check whether **Health Status** of **KrbServer** is **Good**. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`3 `. + +#. .. _alm_20002__en-us_topic_0191813890_en-us_topic_0087039274_li3201870494312: + + Click **Restart** in the **Operation** column of the KrbServer service to restart the service. + +#. Wait for several minutes. Check whether ALM-20002 Hue Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. + +**Check whether DBService is normal.** + +5. .. _alm_20002__en-us_topic_0191813890_li1965161312249: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + +6. In the service list, check whether **Health Status** of **DBService** is **Good**. + + - If yes, go to :ref:`9 `. + - If no, go to :ref:`7 `. + +7. .. _alm_20002__en-us_topic_0191813890_en-us_topic_0087039274_li6300946494312: + + Click **Restart** in the **Operation** column of the DBService service to restart the service. + + .. note:: + + To restart the service, you need to enter the password of the MRS Manager administrator and select **Start or restart related services** . + +8. Wait for several minutes. Check whether ALM-20002 Hue Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`9 `. + +**Check whether the network connected to DBService is normal.** + +9. .. _alm_20002__en-us_topic_0191813890_en-us_topic_0087039274_li3066850394312: + + Choose **Components** > **Hue** > **Instance** and record the IP address of the active Hue node. + +10. Use PuTTY to log in to the active Hue. + +11. Run the **ping** command to check whether the network connection between the host where the active Hue is located and the host where DBService is located is normal. (The method of obtaining the DBService service IP address is the same as that of obtaining the active Hue IP address.) + + - If yes, go to :ref:`17 `. + - If no, go to :ref:`12 `. + +12. .. _alm_20002__en-us_topic_0191813890_en-us_topic_0087039274_li4180632994312: + + Contact the network administrator to repair the network. + +13. Wait for several minutes. Check whether ALM-20002 Hue Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`17 `. + + **Collect fault information.** + +14. On MRS Manager, choose **System** > **Export Log**. + +15. Select the following nodes from the **Services** drop-down list and click **OK**. + + - Hue + - Controller + +16. Set **Start Time** and **End Time** for log collection to 10 minutes before and after the alarm is generated, select an export type, and click **OK** to collect the corresponding fault log information. + +**Restart Hue.** + +17. .. _alm_20002__en-us_topic_0191813890_li8901153153924: + + Choose **Components** > **Hue**. + +18. Choose **More** > **Restart Service** and click **OK**. + +19. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`20 `. + +20. .. _alm_20002__en-us_topic_0191813890_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-23001_loader_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-23001_loader_service_unavailable.rst new file mode 100644 index 0000000..2f9a629 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-23001_loader_service_unavailable.rst @@ -0,0 +1,243 @@ +:original_name: alm_23001.html + +.. _alm_23001: + +ALM-23001 Loader Service Unavailable +==================================== + +Description +----------- + +The system checks the Loader service availability every 60 seconds. This alarm is generated if the Loader service is unavailable and is cleared after the Loader service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +23001 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Data loading, import, and conversion are unavailable. + +Possible Causes +--------------- + +- The services that Loader depends on are abnormal. + + - ZooKeeper is abnormal. + - HDFS is abnormal. + - DBService is abnormal. + - Yarn is abnormal. + - MapReduce is abnormal. + +- The network is faulty. Loader cannot communicate with its dependent services. +- Loader is running improperly. + +Procedure +--------- + +#. Check the ZooKeeper status. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **ZooKeeper** and check whether the health status of ZooKeeper is normal. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`1.c `. + + c. .. _alm_23001__en-us_topic_0191813961_li4731152065314: + + Choose **More > Restart Service** to restart ZooKeeper. After ZooKeeper starts, check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`1.d `. + + d. .. _alm_23001__en-us_topic_0191813961_li173182016530: + + On MRS Manager, check whether the ALM-12007 Process Fault alarm is reported. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2.a `. + + e. .. _alm_23001__en-us_topic_0191813961_li11731152014534: + + In **Alarm Details** of the "ALM-12007 Process Fault" alarm, check whether **ServiceName** is **ZooKeeper**. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`2.a `. + + f. .. _alm_23001__en-us_topic_0191813961_li167320209539: + + Clear the alarm according to the handling suggestions of "ALM-12007 Process Fault". + + g. Check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check the HDFS status. + + a. .. _alm_23001__en-us_topic_0191813961_li103551920123512: + + Go to the MRS cluster details page and choose **Alarms**. + + b. On MRS Manager, check whether the "ALM-14000 HDFS Service Unavailable alarm" is reported. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3.a `. + + c. .. _alm_23001__en-us_topic_0191813961_li167011853195320: + + Clear the alarm according to the handling suggestions of "ALM-14000 HDFS Service Unavailable". + + d. Check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.a `. + +#. Check the DBService status. + + a. .. _alm_23001__en-us_topic_0191813961_li1554455818353: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **DBService** to check whether the health status of DBService is normal. + + - If yes, go to :ref:`4.a `. + - If no, go to :ref:`3.c `. + + c. .. _alm_23001__en-us_topic_0191813961_li122981864542: + + Choose **More > Restart Service** to restart DBService. After DBService starts, check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4.a `. + +#. Check the MapReduce status. + + a. .. _alm_23001__en-us_topic_0191813961_li14598145163614: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **MapReduce** and check whether the health status of MapReduce is normal. + + - If yes, go to :ref:`5.a `. + - If no, go to :ref:`4.c `. + + c. .. _alm_23001__en-us_topic_0191813961_li191227237549: + + Choose **More > Restart Service** to restart MapReduce. After MapReduce starts, check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5.a `. + +#. Check the Yarn status. + + a. .. _alm_23001__en-us_topic_0191813961_li984194223716: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Yarn** and check whether the health status of Yarn is normal. + + - If yes, go to :ref:`5.d `. + - If no, go to :ref:`5.c `. + + c. .. _alm_23001__en-us_topic_0191813961_li126731375547: + + Choose **More > Restart Service** to restart Yarn. After Yarn starts, check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`5.d `. + + d. .. _alm_23001__en-us_topic_0191813961_li11673173775413: + + On MRS Manager, check whether the "ALM-18000 Yarn Service Unavailable" alarm is reported. + + - If yes, go to :ref:`5.e `. + - If no, go to :ref:`6.a `. + + e. .. _alm_23001__en-us_topic_0191813961_li6673837155415: + + Clear the alarm according to the handling suggestions of "ALM-18000 Yarn Service Unavailable". + + f. Check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`6.a `. + +#. Check the network connections between Loader and its dependent components. + + a. .. _alm_23001__en-us_topic_0191813961_li13825217113813: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Click **Loader**. + + c. Click **Instance**. The Sqoop instance list is displayed. + + d. .. _alm_23001__en-us_topic_0191813961_li2928194985415: + + Record the management IP addresses of all Sqoop instances. + + e. Log in to the hosts using the IP addresses obtained in :ref:`6.d `. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + f. Run the **ping** command to check whether the network connection between the hosts where the Sqoop instances reside and the dependent components is normal. (The dependent components include ZooKeeper, DBService, HDFS, MapReduce, and Yarn. The method to obtain the IP addresses of the dependent components is the same as that used to obtain the IP addresses of the Sqoop instances.) + + - If yes, go to :ref:`7 `. + - If no, go to :ref:`6.g `. + + g. .. _alm_23001__en-us_topic_0191813961_li10928124925412: + + Contact the network administrator to repair the network. + + h. Check whether the "ALM-23001 Loader Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`7 `. + +#. .. _alm_23001__en-us_topic_0191813961_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24000_flume_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24000_flume_service_unavailable.rst new file mode 100644 index 0000000..983422d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24000_flume_service_unavailable.rst @@ -0,0 +1,98 @@ +:original_name: alm_24000.html + +.. _alm_24000: + +ALM-24000 Flume Service Unavailable +=================================== + +Description +----------- + +The alarm module checks the Flume service status every 180 seconds. This alarm is generated if the Flume service is abnormal. + +This alarm is cleared after the Flume service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Flume cannot work and data transmission is interrupted. + +Possible Causes +--------------- + +- HDFS is unavailable. +- LdapServer is unavailable. + +Procedure +--------- + +#. Check the HDFS status. + + a. Go to the MRS cluster details page and choose **Alarms**. + b. Check whether the ALM-14000 HDFS Service Unavailable alarm is generated. + + - If yes, clear the alarm according to the handling suggestions of "ALM-14000 HDFS Service Unavailable". + - If no, go to :ref:`2 `. + +#. .. _alm_24000__en-us_topic_0191813917_li56731580163419: + + Check the LdapServer status. + + Check whether the ALM-25000 LdapServer Service Unavailable alarm is generated. + + - If yes, clear the alarm according to the handling suggestions of "ALM-25000 LdapServer Service Unavailable". + - If no, go to :ref:`3.b `. + +#. Check whether the HDFS and LdapServer services are stopped. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_24000__en-us_topic_0191813917_li950355316374: + + In the service list on MRS Manager, check whether the HDFS and LdapServer services are stopped. + + - If yes, start the HDFS and LdapServer services and go to :ref:`3.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_24000__en-us_topic_0191813917_li4163406916374: + + Check whether the "ALM-24000 Flume Service Unavailable" alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_24000__en-us_topic_0191813917_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24001_flume_agent_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24001_flume_agent_is_abnormal.rst new file mode 100644 index 0000000..525ec9c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24001_flume_agent_is_abnormal.rst @@ -0,0 +1,100 @@ +:original_name: alm_24001.html + +.. _alm_24001: + +ALM-24001 Flume Agent Is Abnormal +================================= + +Description +----------- + +This alarm is generated if the Flume agent monitoring module detects that the Flume agent process is abnormal. + +This alarm is cleared after the Flume agent process recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24001 Minor Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +Functions of the alarmed Flume agent instance are abnormal. Data transmission tasks of the instance are suspended. In real-time data transmission, data will be lost. + +Possible Causes +--------------- + +- The **JAVA_HOME** directory does not exist or the Java permission is incorrect. +- The permission of the Flume agent directory is incorrect. + +Procedure +--------- + +#. Check the Flume agent's configuration file. + + a. Log in to the host where the faulty node resides. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the **cd** *Flume installation directory*\ **/fusioninsight-flume-1.6.0/conf/** command to go to Flume's configuration directory. + + c. Run the **cat ENV_VARS** command. Check whether the **JAVA_HOME** directory exists and whether the Flume agent user has execute permission of Java. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.d `. + + d. .. _alm_24001__en-us_topic_0191813918_li5041523116491: + + Specify the correct JAVA_HOME directory and grant the Flume agent user with the execute permission of Java. Then go to :ref:`2.d `. + +#. Check the permission of the Flume agent directory. + + a. .. _alm_24001__en-us_topic_0191813918_li53200420164950: + + Log in to the host where the faulty node resides. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the following command to access the installation directory of the Flume agent: + + **cd** *Flume agent installation directory* + + c. Run the **ls -al \* -R** command. Check whether the owner of all files is the Flume agent user. + + - If yes, go to :ref:`3 `. + - If no, run the **chown** command and change the owner of the files to the Flume agent user. Then go to :ref:`2.d `. + + d. .. _alm_24001__en-us_topic_0191813918_li22464349164950: + + Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_24001__en-us_topic_0191813918_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24003_flume_client_connection_failure.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24003_flume_client_connection_failure.rst new file mode 100644 index 0000000..3ee2b09 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24003_flume_client_connection_failure.rst @@ -0,0 +1,110 @@ +:original_name: alm_24003.html + +.. _alm_24003: + +ALM-24003 Flume Client Connection Failure +========================================= + +Description +----------- + +The alarm module monitors the port connection status on the Flume server. This alarm is generated if the Flume server fails to receive a connection message from the Flume client in 3 consecutive minutes. + +This alarm is cleared after the Flume server receives a connection message from the Flume client. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24003 Major Yes +======== ============== ========== + +Parameters +---------- + +========== ============================================= +Parameter Description +========== ============================================= +ClientIP Specifies the IP address of the Flume client. +ServerIP Specifies the IP address of the Flume server. +ServerPort Specifies the port on the Flume server. +========== ============================================= + +Impact on the System +-------------------- + +The communication between the Flume client and server fails. The Flume client cannot send data to the Flume server. + +Possible Causes +--------------- + +- The network between the Flume client and server is faulty. +- The Flume client's process is abnormal. +- The Flume client is incorrectly configured. + +Procedure +--------- + +#. Check the network between the Flume client and server. + + a. Log in to the host where the alarmed Flume client resides. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the **ping** *Flume server IP address* command to check whether the network between the Flume client and server is normal. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`4 `. + +#. Check whether the Flume client's process is normal. + + a. .. _alm_24003__en-us_topic_0191813877_li33911624175511: + + Log in to the host where the alarmed Flume client resides. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the **ps -ef|grep flume \|grep client** command to check whether the Flume client process exists. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`4 `. + +#. Check the Flume client configuration. + + a. .. _alm_24003__en-us_topic_0191813877_li37860237175538: + + Log in to the host where the alarmed Flume client resides. Run the following command to switch to user **root**: + + **sudo su - root** + + b. Run the **cd** *Flume installation directory*\ **/fusioninsight-flume-1.6.0/conf/** command to go to Flume's configuration directory. + + c. Run the **cat properties.properties** command to query the current configuration file of the Flume client. + + d. Check whether the **properties.properties** file is correctly configured according to the configuration description of the Flume agent. + + - If yes, go to :ref:`3.e `. + - If no, go to :ref:`4 `. + + e. .. _alm_24003__en-us_topic_0191813877_li1644380175538: + + Modify the **properties.properties** configuration file. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_24003__en-us_topic_0191813877_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24004_flume_fails_to_read_data.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24004_flume_fails_to_read_data.rst new file mode 100644 index 0000000..45cb6d2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24004_flume_fails_to_read_data.rst @@ -0,0 +1,143 @@ +:original_name: alm_24004.html + +.. _alm_24004: + +ALM-24004 Flume Fails to Read Data +================================== + +Description +----------- + +The alarm module monitors the Flume source status. This alarm is generated if the duration that Flume source fails to read data exceeds the threshold. + +Users can modify the threshold as required. + +This alarm is cleared if the source reads data successfully. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24004 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+----------------------------------------------------------------+ +| Parameter | Description | ++===============+================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| ComponentType | Specifies the component type for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| ComponentName | Specifies the component name for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ + +Impact on the System +-------------------- + +Data collection is stopped. + +Possible Causes +--------------- + +- The Flume source is faulty. +- The network is faulty. + +Procedure +--------- + +#. Check whether the Flume source is normal. + + a. Check whether the Flume source is the spoolDir type. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`1.c `. + + b. .. _alm_24004__en-us_topic_0191813945_li57424576173633: + + Query the **spoolDir** directory and check whether all files have been sent. + + - If yes, no further action is required. + - If no, go to :ref:`1.e `. + + c. .. _alm_24004__en-us_topic_0191813945_li27889489173633: + + Check whether the Flume source is the Kafka type. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`1.e `. + + d. .. _alm_24004__en-us_topic_0191813945_li35944619173633: + + Log in to the Kafka client and run the following commands to check whether all topic data configured for the Kafka source has been consumed. + + **cd /opt/client/Kafka/kafka/bin** + + **./kafka-consumer-groups.sh --bootstrap-server** *Kafka cluster IP address*\ **:21007** **--new-consumer --describe --group example-group1 --command-config** + + **../config/consumer.properties** + + - If yes, no further action is required. + - If no, go to :ref:`1.e `. + + e. .. _alm_24004__en-us_topic_0191813945_li1487713813414: + + Go to the cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + f. Choose **Flume** > **Instances**. + + g. Click the Flume instance of the faulty node and check whether the value of the **Source Speed Metrics** is 0. + + - If yes, go to :ref:`2.a `. + - If no, no further action is required. + +#. Check the status of the network between the Flume source and faulty node. + + a. .. _alm_24004__en-us_topic_0191813945_li39514043173729: + + Check whether the Flume source is the avro type. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + b. Log in to the host where the faulty node resides. Run the following command to switch to user **root**: + + **sudo su - root** + + c. .. _alm_24004__en-us_topic_0191813945_li52369777173729: + + Run the **ping** *Flume source IP address* command to check whether the Flume source can be pinged. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2.d `. + + d. .. _alm_24004__en-us_topic_0191813945_li27478632173729: + + Contact the network administrator to repair the network. + + e. Wait for a while and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_24004__en-us_topic_0191813945_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24005_data_transmission_by_flume_is_abnormal.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24005_data_transmission_by_flume_is_abnormal.rst new file mode 100644 index 0000000..22d43d0 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-24005_data_transmission_by_flume_is_abnormal.rst @@ -0,0 +1,149 @@ +:original_name: alm_24005.html + +.. _alm_24005: + +ALM-24005 Data Transmission by Flume Is Abnormal +================================================ + +Description +----------- + +The alarm module monitors the capacity of Flume channels. This alarm is generated if the duration that a channel is full or the number of times that a source fails to send data to the channel exceeds the threshold. + +Users can set the threshold as required by modifying the **channelfullcount** parameter. + +This alarm is cleared after the Flume channel space is released. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +24005 Major Yes +======== ============== ========== + +Parameters +---------- + ++---------------+----------------------------------------------------------------+ +| Parameter | Description | ++===============+================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| ComponentType | Specifies the component type for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ +| ComponentName | Specifies the component name for which the alarm is generated. | ++---------------+----------------------------------------------------------------+ + +Impact on the System +-------------------- + +If the usage of the Flume channel continues to grow, the data transmission time increases. When the usage reaches 100%, the Flume agent process is suspended. + +Possible Causes +--------------- + +- The Flume sink is faulty. +- The network is faulty. + +Procedure +--------- + +#. Check whether the Flume sink is normal. + + a. Check whether the Flume sink is the HDFS type. + + - If yes, go to :ref:`1.b `. + - If no, go to :ref:`1.c `. + + b. .. _alm_24005__en-us_topic_0191813885_li35603802172029: + + On MRS Manager, check whether the ALM-14000 HDFS Service Unavailable alarm is reported and whether the HDFS service is stopped. + + - If the alarm is reported, clear it according to the handling suggestions of ALM-14000 HDFS Service Unavailable; if the HDFS service is stopped, start it. Then go to :ref:`1.g `. + - If no, go to :ref:`1.g `. + + c. .. _alm_24005__en-us_topic_0191813885_li17206137172029: + + Check whether the Flume sink is the HBase type. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`1.g `. + + d. .. _alm_24005__en-us_topic_0191813885_li23959037172029: + + On MRS Manager, check whether the ALM-19000 HBase Service Unavailable alarm is reported and whether the HBase service is stopped. + + - If the alarm is reported, clear it according to the handling suggestions of "ALM-19000 HBase Service Unavailable"; if the HBase service is stopped, start it. Then go to :ref:`1.g `. + - If no, go to :ref:`1.g `. + + e. Check whether the Flume sink is the Kafka type. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`1.g `. + + f. .. _alm_24005__en-us_topic_0191813885_li13075641172029: + + On MRS Manager, check whether the ALM-38000 Kafka Service Unavailable alarm is reported and whether the Kafka service is stopped. + + - If the alarm is reported, clear it according to the handling suggestions of "ALM-38000 Kafka Service Unavailable"; if the Kafka service is stopped, start it. Then go to :ref:`1.g `. + - If no, go to :ref:`1.g `. + + g. .. _alm_24005__en-us_topic_0191813885_li1487713813414: + + Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + h. Choose **Flume** > **Instances**. + + i. Click the Flume instance of the faulty node and check whether the value of the **Sink Speed Metrics** is 0. + + - If yes, go to :ref:`2.a `. + - If no, no further action is required. + +#. Check the status of the network between the Flume sink and faulty node. + + a. .. _alm_24005__en-us_topic_0191813885_li60707704172341: + + Check whether the Flume sink is the Avro type. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + b. Log in to the host where the faulty node resides. Run the following command to switch to user **root**: + + **sudo su - root** + + c. .. _alm_24005__en-us_topic_0191813885_li31163561172341: + + Run the **ping** *Flume sink IP address* command to check whether the Flume sink can be pinged. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`2.d `. + + d. .. _alm_24005__en-us_topic_0191813885_li35581265172341: + + Contact the network administrator to repair the network. + + e. Wait for a while and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_24005__en-us_topic_0191813885_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst new file mode 100644 index 0000000..b35e854 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25000_ldapserver_service_unavailable.rst @@ -0,0 +1,119 @@ +:original_name: alm_25000.html + +.. _alm_25000: + +ALM-25000 LdapServer Service Unavailable +======================================== + +Description +----------- + +The system checks the LdapServer service status every 30 seconds. This alarm is generated when the active and standby LdapServer services are abnormal. + +This alarm is cleared when either of the LdapServer services restores. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +When this alarm is generated, no operation can be performed for the KrbServer users and LdapServer users in the cluster. For example, users, user groups, or roles cannot be added, deleted, or modified, and user passwords cannot be changed on MRS Manager. The authentication for existing users in the cluster is not affected. + +Possible Causes +--------------- + +- The node where the LdapServer service locates is faulty. +- The LdapServer process is abnormal. + +Procedure +--------- + +#. Check whether the nodes where the two SlapdServer instances of the LdapServer service locate are faulty. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_25000__en-us_topic_0191813909_aalm-25000_mmccppss_id: + + Choose **LdapServer** > **Instances**. Go to the LdapServer instance page to obtain the host name of the node where the two SlapdServer instances reside. + + c. On the **Alarms** page of MRS Manager, check whether the alarm ALM-12006 Node Fault is generated. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a `. + + d. .. _alm_25000__en-us_topic_0191813909_aalm-25000_mmccppss_step_4: + + Check whether the host name in the alarm information is the same as the actual host name in :ref:`1.b `. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2.a `. + + e. .. _alm_25000__en-us_topic_0191813909_aalm-25000_mmccppss_alarm53003: + + Rectify the fault by following steps provided in ALM-12006 Node Fault. + + f. In the alarm list, check whether the alarm ALM-25000 LdapServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. Check whether the LdapServer process is in normal state. + + a. .. _alm_25000__en-us_topic_0191813909_li192616463126: + + Go to the cluster details page and choose **Alarms**. + + b. Check whether ALM-12007 Process Fault is generated. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_25000__en-us_topic_0191813909_aalm-25000_mmccppss_step_8: + + Check whether the service name and host name in the alarm are consistent with the LdapServer service and host names. + + - If yes, go to :ref:`2.d `. + - If no, go to :ref:`3 `. + + d. .. _alm_25000__en-us_topic_0191813909_alarm53004: + + Rectify the fault by following steps provided in ALM-12007 Process Fault. + + e. In the alarm list, check whether the alarm ALM-25000 LdapServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_25000__en-us_topic_0191813909_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst new file mode 100644 index 0000000..8bf50d0 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25004_abnormal_ldapserver_data_synchronization.rst @@ -0,0 +1,140 @@ +:original_name: alm_25004.html + +.. _alm_25004: + +ALM-25004 Abnormal LdapServer Data Synchronization +================================================== + +Description +----------- + +This alarm is generated when LdapServer data on Manager is inconsistent. This alarm is cleared when the data becomes consistent. + +This alarm is generated when LdapServer data in the cluster is inconsistent with LdapServer data on Manager. This alarm is cleared when the data becomes consistent. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25004 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +LdapServer data inconsistency occurs because LdapServer data on Manager or in the cluster is damaged. The LdapServer process with damaged data cannot provide services externally, and the authentication functions of Manager and the cluster are affected. + +Possible Causes +--------------- + +- The network of the node where the LdapServer process locates is faulty. +- The LdapServer process is abnormal. +- The OS restart damages data on LdapServer. + +Procedure +--------- + +#. Check whether the network where the LdapServer nodes reside is faulty. + + a. Go to the cluster details page and choose **Alarms**. + + b. Record the IP address of **HostName** in **Location** of the alarm as **IP1** (if multiple alarms exist, record the IP addresses as **IP1**, **IP2**, and **IP3** respectively). + + c. Contact O&M personnel and use PuTTY to log in to the node corresponding to **IP1**. Run the **ping** command on the node to check whether the IP address of the management plane of the active OMS node can be pinged. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a `. + + d. .. _alm_25004__en-us_topic_0191813874_aalm-25004_mmccppss_step3: + + Contact O&M personnel to recover the network and check whether the alarm **ALM-25004 Abnormal LdapServer Data Synchronization** is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check whether the LdapServer process is in normal state. + + a. .. _alm_25004__en-us_topic_0191813874_li4768141014141: + + Go to the cluster details page and choose **Alarms**. + + b. Check whether ALM-12004 OLdap Resource Is Abnormal is generated for LdapServer. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`2.e `. + + c. .. _alm_25004__en-us_topic_0191813874_aalm-25004_mmccppss_step5: + + Rectify the fault by following steps provided in **ALM-12004 OLdap Resource Is Abnormal**. + + d. Check whether the alarm ALM-25004 Abnormal LdapServer Data Synchronization is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.e `. + + e. .. _alm_25004__en-us_topic_0191813874_aalm-25004_mmccppss_step7: + + On the **Alarms** page of MRS Manager, check whether the alarm ALM-12007 Process Fault of LdapServer is generated. + + - If yes, go to :ref:`2.f `. + - If no, go to :ref:`3.a `. + + f. .. _alm_25004__en-us_topic_0191813874_step8: + + Rectify the fault by following steps provided in ALM-12007 Process Fault. + + g. Check whether the alarm ALM-25004 Abnormal LdapServer Data Synchronization is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3.a `. + +#. Check whether the OS restart damages data on LdapServer. + + a. .. _alm_25004__en-us_topic_0191813874_li1816316468144: + + Go to the cluster details page and choose **Alarms**. + + b. Record the IP address of **HostName** in **Location** of the alarm as **IP1** (if multiple alarms exist, record the IP addresses as **IP1**, **IP2**, and **IP3** respectively). Choose **Services** > **LdapServer** > **Service Configuration** and record the LdapServer port number as **PORT**. (If the IP address in the alarm location information is the IP address of the standby OMS node, the default port number is 21750.) + + c. Log in to node **IP1** as user **omm** and run the **ldapsearch -H ldaps://IP1:PORT -x -LLL -b dc=hadoop,dc=com** command (if the IP address is the IP address of the standby OMS node, run the **ldapsearch -H ldaps://IP1:PORT -x -LLL -b dc=hadoop,dc=com** command before running this command). Check whether error information is displayed in the command output. + + - If yes, go to :ref:`3.d `. + - If no, go to :ref:`4 `. + + d. .. _alm_25004__en-us_topic_0191813874_aalm-25004_mmccppss_step12: + + Recover the LdapServer and OMS nodes using backup data before the alarm is generated. For details, see section "Recovering Manager Data" in the *Administrator Guide*. + + .. note:: + + Use the OMS data and LdapServer data backed up at the same time to restore data. Otherwise, the service and operation may fail. To recover data when services run properly, you are advised to manually back up the latest management data and then recover the data. Otherwise, Manager data produced between the backup point in time and the recovery point in time will be lost. + + e. Check whether the alarm ALM-25004 Abnormal LdapServer Data Synchronization is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_25004__en-us_topic_0191813874_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25500_krbserver_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25500_krbserver_service_unavailable.rst new file mode 100644 index 0000000..1579bff --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-25500_krbserver_service_unavailable.rst @@ -0,0 +1,112 @@ +:original_name: alm_25500.html + +.. _alm_25500: + +ALM-25500 KrbServer Service Unavailable +======================================= + +Description +----------- + +The system checks the KrbServer service status every 30 seconds. This alarm is generated when the KrbServer service is abnormal. + +This alarm is cleared when the KrbServer service is in normal state. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +25500 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +When this alarm is generated, no operation can be performed for the KrbServer component in the cluster. The authentication of KrbServer in other components will be affected. The health status of components that depend on KrbServer in the cluster is **Bad**. + +Possible Causes +--------------- + +- The node where the KrbServer service locates is faulty. +- The OLdap service is unavailable. + +Procedure +--------- + +#. Check whether the node where the KrbServer service locates is faulty. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_25500__en-us_topic_0191813953_aalm-25500_mmccppss_id: + + Choose **KrbServer** > **Instances**. Go to the KrbServer instance page and view the host name of the node where the KrbServer service is deployed. + + c. On the **Alarms** page of MRS Manager, check whether the alarm ALM-12006 Node Fault is generated. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a `. + + d. .. _alm_25500__en-us_topic_0191813953_aalm-25500_mmccppss_step_4: + + Check whether the host name in the alarm information is the same as the actual host name in :ref:`1.b `. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2.a `. + + e. .. _alm_25500__en-us_topic_0191813953_aalm-25500_mmccppss_alarm53003: + + Rectify the fault by following steps provided in ALM-12006 Node Fault. + + f. In the alarm list, check whether the alarm ALM-25500 KrbServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. Check whether the OLdap service is unavailable. + + a. .. _alm_25500__en-us_topic_0191813953_li14191191521615: + + Go to the cluster details page and choose **Alarms**. + + b. Check whether ALM-12004 OLdap Resource Is Abnormal is generated. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`3 `. + + c. .. _alm_25500__en-us_topic_0191813953_aalm-25500_mmccppss_step_8: + + Rectify the fault by following steps provided in ALM-12004 OLdap Resource Is Abnormal. + + d. In the alarm list, check whether the alarm ALM-25500 KrbServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`3 `. + +#. .. _alm_25500__en-us_topic_0191813953_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26051_storm_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26051_storm_service_unavailable.rst new file mode 100644 index 0000000..5a28c56 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26051_storm_service_unavailable.rst @@ -0,0 +1,131 @@ +:original_name: alm_26051.html + +.. _alm_26051: + +ALM-26051 Storm Service Unavailable +=================================== + +Description +----------- + +The system checks the Storm service availability every 30 seconds. This alarm is generated if the Storm service becomes unavailable after all Nimbus nodes in a cluster become abnormal. + +This alarm is cleared after the Storm service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +26051 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +- The cluster cannot provide the Storm service. +- Users cannot run new Storm tasks. + +Possible Causes +--------------- + +- The Kerberos component is faulty. +- ZooKeeper is faulty or suspended. +- The active and standby Nimbus nodes in the Storm cluster are abnormal. + +Procedure +--------- + +#. Check the Kerberos component status. For clusters without Kerberos authentication, skip this step and go to :ref:`2 `. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_26051__en-us_topic_0191813871_li4574896917592: + + Check whether the health status of the Kerberos service is **Good**. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.c `. + + c. .. _alm_26051__en-us_topic_0191813871_li22276139175922: + + Rectify the fault by following instructions in ALM-25500 KrbServer Service Unavailable. + + d. Perform :ref:`1.b ` again. + +#. .. _alm_26051__en-us_topic_0191813871_li59618494175936: + + Check the ZooKeeper component status. + + a. .. _alm_26051__en-us_topic_0191813871_li384738318010: + + Check whether the health status of the ZooKeeper service is **Good**. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.b `. + + b. .. _alm_26051__en-us_topic_0191813871_li398384891819: + + If the ZooKeeper service is stopped, start it. For other problems, follow the instructions in ALM-13000 ZooKeeper Service Unavailable. + + c. Perform :ref:`2.a ` again. + +#. Check the status of the active and standby Nimbus nodes. + + a. .. _alm_26051__en-us_topic_0191813871_li2005716918338: + + Choose **Components** > **Storm** > **Nimbus**. + + b. In **Role**, check whether only one active Nimbus node exists. + + - If yes, go to :ref:`4 `. + - If no, go to :ref:`3.c `. + + c. .. _alm_26051__en-us_topic_0191813871_li4603773018356: + + Select the two Nimbus instances and choose **More** > **Restart Instance**. Check whether the restart is successful. + + - If yes, go to :ref:`3.d `. + - If no, go to :ref:`4 `. + + d. .. _alm_26051__en-us_topic_0191813871_li632054418412: + + Log in to MRS Manager again and choose **Components** > **Storm** > **Nimbus**. Check whether the health status of Nimbus is **Good**. + + - If yes, go to :ref:`3.e `. + - If no, go to :ref:`4 `. + + e. .. _alm_26051__en-us_topic_0191813871_li5966586218421: + + Wait 30 seconds and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_26051__en-us_topic_0191813871_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26052_number_of_available_supervisors_in_storm_is_lower_than_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26052_number_of_available_supervisors_in_storm_is_lower_than_the_threshold.rst new file mode 100644 index 0000000..96441f8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26052_number_of_available_supervisors_in_storm_is_lower_than_the_threshold.rst @@ -0,0 +1,94 @@ +:original_name: alm_26052.html + +.. _alm_26052: + +ALM-26052 Number of Available Supervisors in Storm Is Lower Than the Threshold +============================================================================== + +Description +----------- + +The system checks the number of supervisors every 60 seconds and compares it with the threshold. This alarm is generated if the number of supervisors is lower than the threshold. + +To modify the threshold, users can choose **System** > **Threshold Configuration** on MRS Manager. + +This alarm is cleared if the number of supervisors is greater than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +26052 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +- Existing tasks in the cluster cannot be executed. +- The cluster can receive new Storm tasks but cannot execute them. + +Possible Causes +--------------- + +Supervisors are abnormal in the cluster. + +Procedure +--------- + +#. Check the supervisor status. + + a. Go to the cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Storm** > **Supervisor**. + + c. In **Role**, check whether the cluster has supervisor instances that are in the **Faulty** or **Recovering** state. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2 `. + + d. .. _alm_26052__en-us_topic_0191813898_li65587069184020: + + Select the supervisor instances that are in the **Faulty** or **Recovering** state and choose **More** > **Restart Instance**. + + - If yes, go to :ref:`1.e `. + - If the restart fails, go to :ref:`2 `. + + e. .. _alm_26052__en-us_topic_0191813898_li52566748184020: + + Wait 30 seconds and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_26052__en-us_topic_0191813898_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26053_slot_usage_of_storm_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26053_slot_usage_of_storm_exceeds_the_threshold.rst new file mode 100644 index 0000000..ab81011 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26053_slot_usage_of_storm_exceeds_the_threshold.rst @@ -0,0 +1,124 @@ +:original_name: alm_26053.html + +.. _alm_26053: + +ALM-26053 Slot Usage of Storm Exceeds the Threshold +=================================================== + +Description +----------- + +The system checks the slot usage of Storm every 60 seconds and compares it with the threshold. This alarm is generated if the slot usage exceeds the threshold. + +To modify the threshold, users can choose **System** > **Threshold Configuration** on MRS Manager. + +This alarm is cleared if the slot usage is lower than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +26053 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Users cannot run new Storm tasks. + +Possible Causes +--------------- + +- Supervisors are abnormal in the cluster. +- Supervisors are normal but have poor processing capability. + +Procedure +--------- + +#. Check the supervisor status. + + a. Go to the cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Storm** > **Supervisor**. + + c. In **Role**, check whether the cluster has supervisor instances that are in the **Faulty** or **Recovering** state. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a ` or :ref:`3.a `. + + d. .. _alm_26053__en-us_topic_0191813942_li6671657118374: + + Select the supervisor instances that are in the **Faulty** or **Recovering** state and choose **More** > **Restart Instance**. + + - If yes, go to :ref:`1.e `. + - If the restart fails, go to :ref:`4 `. + + e. .. _alm_26053__en-us_topic_0191813942_li5198268318374: + + Wait a moment and then check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2.a ` or :ref:`3.a `. + +#. Increase the number of slots for the supervisors. + + a. .. _alm_26053__en-us_topic_0191813942_li142406612228: + + Go to the cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **Storm** > **Supervisor** > **Service Configuration**, and set **Type** to **All**. + + c. Increase the value of **supervisor.slots.ports** to increase the number of slots for each supervisor. Then restart the instances. + + d. Wait a moment and then check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. Expand the capacity of the supervisors. + + a. .. _alm_26053__en-us_topic_0191813942_li22838295183633: + + Add nodes. + + b. Wait a moment and then check whether the alarm is cleared. + + - If yes, no further action is required. + - If the restart fails, go to :ref:`4 `. + +#. .. _alm_26053__en-us_topic_0191813942_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26054_heap_memory_usage_of_storm_nimbus_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26054_heap_memory_usage_of_storm_nimbus_exceeds_the_threshold.rst new file mode 100644 index 0000000..e8cab46 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-26054_heap_memory_usage_of_storm_nimbus_exceeds_the_threshold.rst @@ -0,0 +1,88 @@ +:original_name: alm_26054.html + +.. _alm_26054: + +ALM-26054 Heap Memory Usage of Storm Nimbus Exceeds the Threshold +================================================================= + +Description +----------- + +The system checks the heap memory usage of Storm Nimbus every 30 seconds and compares it with the threshold. This alarm is generated if the heap memory usage exceeds the threshold (80% by default). + +To modify the threshold, users can choose **System** > **Threshold Configuration** > **Service** > **Storm** on MRS Manager. + +This alarm is cleared if the heap memory usage is lower than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +26054 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Frequent memory garbage collection or memory overflow may occur, affecting submission of Storm services. + +Possible Causes +--------------- + +The heap memory usage is high or the heap memory is improperly allocated. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Choose **ALM-26054 Heap Memory Usage of Storm Nimbus Exceeds the Threshold > Location**. Query the **HostName** of the alarmed instance. + + c. Choose **Components > Storm > Instances > Nimbus (corresponding to the HostName of the alarmed instance) > Customize > Heap Memory Usage of Nimbus**. + + d. Check whether the heap memory usage of Nimbus has reached the threshold (80%). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_26054__en-us_topic_0191813940_li3532012320227: + + Adjust the heap memory. + + Choose **Components > Storm > Service Configuration**, and set **Type** to **All**. Choose **Nimbus** > **System**. Increase the value of **-Xmx** in **NIMBUS_GC_OPTS**. Click **Save Configuration**. Select **Restart the affected services or instances** and click **OK**. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_26054__en-us_topic_0191813940_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27001_dbservice_is_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27001_dbservice_is_unavailable.rst new file mode 100644 index 0000000..db15503 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27001_dbservice_is_unavailable.rst @@ -0,0 +1,159 @@ +:original_name: alm_27001.html + +.. _alm_27001: + +ALM-27001 DBService Is Unavailable +================================== + +Description +----------- + +The alarm module checks the DBService status every 30 seconds. This alarm is generated when the system detects that DBService is unavailable. + +This alarm is cleared when DBService recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +27001 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The database service is unavailable and cannot provide data import and query functions for upper-layer services, which results in service exceptions. + +Possible Causes +--------------- + +- The floating IP address does not exist. +- There is no active DBServer instance. +- The active and standby DBServer processes are abnormal. + +Procedure +--------- + +#. Check whether the floating IP address exists in the cluster environment. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. Choose **DBService** > **Instances**. + + c. Check whether the active instance exists. + + - If yes, go to :ref:`1.d `. + - If no, go to :ref:`2.a `. + + d. .. _alm_27001__en-us_topic_0191813879_step111: + + Select the active DBServer instance and record the IP address. + + e. Log in to the host with the preceding IP address and run the **ifconfig** command to check whether the DBService floating IP address exists on the node. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`2.a `. + + f. .. _alm_27001__en-us_topic_0191813879_checkfloatip: + + Run the **ping** *floating IP address* command to check whether the DBService floating IP address can be pinged. + + - If yes, go to :ref:`1.g `. + - If no, go to :ref:`2.a `. + + g. .. _alm_27001__en-us_topic_0191813879_findfloatip: + + Log in to the host where the DBService floating IP address is located and run the **ifconfig** *interface* **down** command to delete the floating IP address. + + h. Choose **Components** > **DBService** > **More** > **Restart Service** to restart DBService and check whether DBService is started successfully. + + - If yes, go to :ref:`1.i `. + - If no, go to :ref:`2.a `. + + i. .. _alm_27001__en-us_topic_0191813879_resumealarm1: + + Wait about 2 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`Step 13 `. + +#. Check the status of the active DBServer instance. + + a. .. _alm_27001__en-us_topic_0191813879_step88: + + Select the DBServer instance whose role status is abnormal and record the IP address. + + b. On the **Alarms** page, check whether ALM-12007 Process Fault occurs in the DBServer instance on the host that corresponds to the IP address. + + - If yes, go to :ref:`2.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_27001__en-us_topic_0191813879_alarm27001: + + Rectify the fault by following steps provided in ALM-12007 Process Fault. + + d. Wait about 5 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. Check the status of the active and standby DBServers. + + a. .. _alm_27001__en-us_topic_0191813879_loginact: + + Log in to the host where the DBService floating IP address is located, run the **sudo su - root** and **su - omm** commands to switch to user **omm**, and run the **cd ${BIGDATA_HOME}/FusionInsight/dbservice/** command to go to the DBService installation directory. + + b. Run the **sh sbin/status-dbserver.sh** command to view the status of the active and standby HA processes of DBService. Determine whether the status can be viewed successfully. + + - If yes, go to :ref:`3.c `. + - If no, go to :ref:`4 `. + + c. .. _alm_27001__en-us_topic_0191813879_loginactive: + + Check whether the active and standby HA processes are abnormal. + + - If yes, go to :ref:`3.d `. + - If no, go to :ref:`4 `. + + d. .. _alm_27001__en-us_topic_0191813879_recoverdb: + + Choose **Components** > **DBService** > **More** > **Restart Service** to restart DBService and check whether DBService is started successfully. + + - If yes, go to :ref:`3.e `. + - If no, go to :ref:`4 `. + + e. .. _alm_27001__en-us_topic_0191813879_resumealarm: + + Wait about 2 minutes and check whether the alarm is cleared in the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_27001__en-us_topic_0191813879_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst new file mode 100644 index 0000000..644e6d5 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes.rst @@ -0,0 +1,91 @@ +:original_name: alm_27003.html + +.. _alm_27003: + +ALM-27003 DBService Heartbeat Interruption Between the Active and Standby Nodes +=============================================================================== + +Description +----------- + +This alarm is generated when the active or standby DBService node does not receive heartbeat messages from the peer node. + +This alarm is cleared when the heartbeat recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +27003 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------------+---------------------------------------------------------+ +| Parameter | Description | ++=========================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| Local DBService HA Name | Specifies a local DBService HA. | ++-------------------------+---------------------------------------------------------+ +| Peer DBService HA Name | Specifies a peer DBService HA. | ++-------------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +During the DBService heartbeat interruption, only one node can provide the service. If this node is faulty, no standby node is available for failover and the service is unavailable. + +Possible Causes +--------------- + +The link between the active and standby DBService nodes is abnormal. + +Procedure +--------- + +#. Check whether the network between the active and standby DBService servers is in normal state. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, locate the row that contains the alarm and view the IP address of the standby DBService server in the alarm details. + + c. Log in to the active DBService server. + + d. Run the **ping** *heartbeat IP address of the standby DBService* command to check whether the standby DBService server is reachable. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`1.e `. + + e. .. _alm_27003__en-us_topic_0191813956_alm-27002_2_mmccppss_step2: + + Contact the network administrator to check whether the network is faulty. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`2 `. + + f. .. _alm_27003__en-us_topic_0191813956_alm-27002_2_mmccppss_s4: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_27003__en-us_topic_0191813956_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst new file mode 100644 index 0000000..6d7c396 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-27004_data_inconsistency_between_active_and_standby_dbservices.rst @@ -0,0 +1,154 @@ +:original_name: alm_27004.html + +.. _alm_27004: + +ALM-27004 Data Inconsistency Between Active and Standby DBServices +================================================================== + +Description +----------- + +The system checks the data synchronization status between the active and standby DBServices every 10 seconds. This alarm is generated when the synchronization status cannot be queried for six consecutive times or when the synchronization status is abnormal. + +This alarm is cleared when the synchronization is in normal state. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +27004 Critical Yes +======== ============== ========== + +Parameters +---------- + ++-------------------------+---------------------------------------------------------+ +| Parameter | Description | ++=========================+=========================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------------+---------------------------------------------------------+ +| Local DBService HA Name | Specifies a local DBService HA. | ++-------------------------+---------------------------------------------------------+ +| Peer DBService HA Name | Specifies a peer DBService HA. | ++-------------------------+---------------------------------------------------------+ +| SYNC_PERSENT | Synchronization percentage. | ++-------------------------+---------------------------------------------------------+ + +Impact on the System +-------------------- + +When data is not synchronized between the active and standby DBServices, the data may be lost or abnormal if the active instance becomes abnormal. + +Possible Causes +--------------- + +- The network between the active and standby nodes is unstable. +- The standby DBService is abnormal. +- The disk space of the standby node is full. + +Procedure +--------- + +#. Check whether the network between the active and standby nodes is in normal state. + + a. Go to the cluster details page and choose **Alarms**. + + b. In the alarm list, locate the row that contains the alarm and view the IP address of the standby DBService node in the alarm details. + + c. Log in to the active DBService node. + + d. Run the **ping** *heartbeat IP address of the standby DBService* command to check whether the standby DBService node is reachable. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.e `. + + e. .. _alm_27004__en-us_topic_0191813894_alm-27002_3_mmccppss_step2: + + Contact the O&M personnel to check whether the network is faulty. + + - If yes, go to :ref:`1.f `. + - If no, go to :ref:`2.a `. + + f. .. _alm_27004__en-us_topic_0191813894_alm-27002_3_mmccppss_s4: + + Rectify the network fault and check whether the alarm is cleared from the alarm list. + + - If yes, no further action is required. + - If no, go to :ref:`2.a `. + +#. Check whether the standby DBService is in normal state. + + a. .. _alm_27004__en-us_topic_0191813894_alm-27002_3_mmccppss_step6: + + Log in to the standby DBService node. + + b. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + c. Go to the **${DBSERVER_HOME}/sbin** directory and run the **./status-dbserver.sh** command to check whether the GaussDB resource status of the standby DBService is in normal state. In the command output, check whether the following information is displayed in the row where **ResName** is **gaussDB**: + + Example: + + .. code-block:: + + 10_10_10_231 gaussDB Standby_normal Normal Active_standby + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`4 `. + +#. Check whether the disk space of the standby node is insufficient. + + a. .. _alm_27004__en-us_topic_0191813894_alm-27002_3_mmccppss_step9: + + Log in to the standby DBService node. + + b. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + + c. Go to the **${DBSERVER_HOME}** directory, and run the following commands to obtain the DBService data directory: + + **cd ${DBSERVER_HOME}** + + **source .dbservice_profile** + + **echo ${DBSERVICE_DATA_DIR}** + + d. Run the **df -h** command to check the system disk partition usage. + + e. Check whether the DBService data directory space is full. + + - If yes, go to :ref:`3.f `. + - If no, go to :ref:`4 `. + + f. .. _alm_27004__en-us_topic_0191813894_alm-27002_3_mmccppss_step14: + + Perform upgrade and expand capacity. + + g. After capacity expansion, wait 2 minutes and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_27004__en-us_topic_0191813894_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-28001_spark_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-28001_spark_service_unavailable.rst new file mode 100644 index 0000000..890d006 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-28001_spark_service_unavailable.rst @@ -0,0 +1,88 @@ +:original_name: alm_28001.html + +.. _alm_28001: + +ALM-28001 Spark Service Unavailable +=================================== + +Description +----------- + +The system checks the Spark service status every 30 seconds. This alarm is generated when the Spark service is unavailable. + +This alarm is cleared when the Spark service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +28001 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Spark tasks submitted by users fail to be executed. + +Possible Causes +--------------- + +- The KrbServer service is abnormal. +- The LdapServer service is abnormal. +- The ZooKeeper service is abnormal. +- The HDFS service is abnormal. +- The Yarn service is abnormal. +- The corresponding Hive service is abnormal. + +Procedure +--------- + +#. Check whether service unavailability alarms exist in services that Spark depends on. + + a. Go to the MRS cluster details page and choose **Alarms**. + + b. Check whether the following alarms exist in the alarm list: + + #. ALM-25500 KrbServer Service Unavailable + #. ALM-25000 LdapServer Service Unavailable + #. ALM-13000 ZooKeeper Service Unavailable + #. ALM-14000 HDFS Service Unavailable + #. ALM-18000 Yarn Service Unavailable + #. ALM-16004 Hive Service Unavailable + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2 `. + + c. .. _alm_28001__en-us_topic_0191813883_li645282320039: + + Handle the alarms based on the troubleshooting methods provided in the alarm help. + + After the alarm is cleared, wait a few minutes and check whether the alarm HetuServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_28001__en-us_topic_0191813883_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38000_kafka_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38000_kafka_service_unavailable.rst new file mode 100644 index 0000000..275cae8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38000_kafka_service_unavailable.rst @@ -0,0 +1,130 @@ +:original_name: alm_38000.html + +.. _alm_38000: + +ALM-38000 Kafka Service Unavailable +=================================== + +Description +----------- + +The system checks the Kafka service availability every 30 seconds. This alarm is generated if the Kafka service becomes unavailable. + +This alarm is cleared after the Kafka service recovers. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +38000 Critical Yes +======== ============== ========== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The cluster cannot provide the Kafka service and users cannot run new Kafka tasks. + +Possible Causes +--------------- + +- The KrbServer component is faulty. +- The ZooKeeper component is faulty or fails to respond. +- The Broker node in the Kafka cluster is abnormal. + +Procedure +--------- + +#. Check the KrbServer component status. For clusters without Kerberos authentication, skip this step and go to :ref:`2 `. + + a. Go to the MRS cluster details page and click **Components**. + + .. note:: + + For MRS 1.7.2 or earlier, log in to MRS Manager and click **Services**. + + b. .. _alm_38000__en-us_topic_0191813970_li1071286918299: + + Check whether the health status of the KrbServer service is **Good**. + + - If yes, go to :ref:`2.a `. + - If no, go to :ref:`1.c `. + + c. .. _alm_38000__en-us_topic_0191813970_li50060872182922: + + Rectify the fault by following instructions in ALM-25500 KrbServer Service Unavailable. + + d. Perform :ref:`1.b ` again. + +#. .. _alm_38000__en-us_topic_0191813970_li21507667181241: + + Check the ZooKeeper component status. + + a. .. _alm_38000__en-us_topic_0191813970_li22712539182948: + + Check whether the health status of the ZooKeeper service is **Good**. + + - If yes, go to :ref:`3.a `. + - If no, go to :ref:`2.b `. + + b. .. _alm_38000__en-us_topic_0191813970_li35295745182948: + + If the ZooKeeper service is stopped, start it. For other problems, follow the instructions in ALM-13000 ZooKeeper Service Unavailable. + + c. Perform :ref:`2.a ` again. + +#. Check the Broker status. + + a. .. _alm_38000__en-us_topic_0191813970_li6551802183028: + + Choose **Components** > **Kafka** > **Broker**. + + b. In **Role**, check whether all instances are normal. + + - If yes, go to :ref:`3.d `. + - If no, go to :ref:`3.c `. + + c. .. _alm_38000__en-us_topic_0191813970_li7614495183028: + + Select all instances of Broker and choose **More** > **Restart Instance**. + + - If the restart is successful, go to :ref:`3.d `. + - If the restart fails, go to :ref:`4 `. + + d. .. _alm_38000__en-us_topic_0191813970_li54013684183028: + + Choose **Components > Kafka**. Check whether the health status of Kafka is **Good**. + + - If yes, go to :ref:`3.e `. + - If no, go to :ref:`4 `. + + e. .. _alm_38000__en-us_topic_0191813970_li11571314183028: + + Wait 30 seconds and check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`4 `. + +#. .. _alm_38000__en-us_topic_0191813970_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38001_insufficient_kafka_disk_space.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38001_insufficient_kafka_disk_space.rst new file mode 100644 index 0000000..7566ab6 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38001_insufficient_kafka_disk_space.rst @@ -0,0 +1,171 @@ +:original_name: alm_38001.html + +.. _alm_38001: + +ALM-38001 Insufficient Kafka Disk Space +======================================= + +Description +----------- + +The system checks the Kafka disk usage every 60 seconds and compares it with the threshold. This alarm is generated if the disk usage exceeds the threshold. + +To modify the threshold, users can choose **System** > **Threshold Configuration** on MRS Manager. + +This alarm is cleared if the Kafka disk usage is lower than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +38001 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| PartitionName | Specifies the disk partition where the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Kafka fails to write data to the disks. + +Possible Causes +--------------- + +- The Kafka disk configurations (such as disk count and disk size) are insufficient for the data volume. +- The data retention period is long and historical data occupies large space. +- Services are improperly planned. As a result, data is unevenly distributed and some disks are full. + +Procedure +--------- + +#. Go to the MRS cluster details page and choose **Alarms**. + +#. .. _alm_38001__en-us_topic_0191813921_li13769123214531: + + In the alarm list, click the alarm and view the **HostName** and **PartitionName** of the alarm in **Location** of **Alarm Details**. + +#. On the **Hosts** page, click the host name obtained in :ref:`2 `. + +#. Check whether the **Disk** area contains the **PartitionName** of the alarm. + + - If yes, go to :ref:`5 `. + - If no, manually clear the alarm and no further action is required. + +#. .. _alm_38001__en-us_topic_0191813921_li15769133210538: + + In the **Disk** area, check whether the usage of the alarmed partition has reached 100%. + + - If yes, go to :ref:`6 `. + - If no, go to :ref:`8 `. + +#. .. _alm_38001__en-us_topic_0191813921_li16769832165314: + + In **Instance**, choose **Broker > Instance Configuration**. On the **Instance Configuration** page that is displayed, set **Type** to **All** and query the data directory parameter **log.dirs**. + +#. Choose **Components > Kafka > Instances**. On the **Kafka Instance** page that is displayed, stop the Broker instance corresponding to :ref:`2 `. Then log in to the alarmed node and manually delete the data directory in :ref:`6 `. After all subsequent operations are complete, start the Broker instance. + +#. .. _alm_38001__en-us_topic_0191813921_li1476912324536: + + Choose **Components > Kafka > Service Configuration**. The **Kafka Configuration** page is displayed. + +#. Check whether **disk.adapter.enable** is **true**. + + - If yes, go to :ref:`11 `. + - If no, change the value to **true** and go to :ref:`10 `. + +#. .. _alm_38001__en-us_topic_0191813921_li37691432185316: + + Check whether the **adapter.topic.min.retention.hours** parameter, indicating the minimum data retention period, is properly configured. + + - If yes, go to :ref:`12 `. + - If no, set it to a proper value and go to :ref:`12 `. + + .. note:: + + If the retention period cannot be adjusted for certain topics, the topics can be added to **disk.adapter.topic.blacklist**. + +#. .. _alm_38001__en-us_topic_0191813921_li19769532105319: + + Wait 10 minutes and check whether the disk usage is reduced. + + - If yes, wait until the alarm is cleared. + - If no, go to :ref:`12 `. + +#. .. _alm_38001__en-us_topic_0191813921_li076953295311: + + Go to the **Kafka Topic Monitor** page and query the data retention period configured for Kafka. Determine whether the retention period needs to be shortened based on service requirements and data volume. + + - If yes, go to :ref:`13 `. + - If no, go to :ref:`14 `. + +#. .. _alm_38001__en-us_topic_0191813921_li10769173213539: + + Find the topics with great data volumes based on the disk partition obtained in :ref:`2 `. Log in to the Kafka client and manually shorten the data retention period for these topics using the following command: + + **kafka-topics.sh --zookeeper** *ZooKeeper address:24002/kafka* **--alter --topic** *Topic name* **--config retention.ms=**\ *Retention period* + +#. .. _alm_38001__en-us_topic_0191813921_li1176913210535: + + Check whether partitions are properly configured for topics. For example, if the number of partitions for a topic with a large data volume is smaller than the number of disks, data may be unevenly distributed to the disks and the usage of some disks will reach the upper limit. + + .. note:: + + To identify topics with great data volumes, log in to the relevant nodes that are obtained in :ref:`2 `, go to the data directory (the directory before **log.dirs** in :ref:`6 ` is modified), and check the disk space occupied by the partitions of the topics. + + - If the partitions are improperly configured, go to :ref:`15 `. + - If the partitions are properly configured, go to :ref:`16 `. + +#. .. _alm_38001__en-us_topic_0191813921_li137701132145312: + + On the Kafka client, add partitions to the topics. + + **kafka-topics.sh --zookeeper** *ZooKeeper address:24002/kafka* **--alter --topic** *Topic name* **--partitions=**\ *Number of new partitions* + + .. note:: + + It is advised to set the number of new partitions to a multiple of the number of Kafka disks. + + This operation may not quickly clear the alarm. Data will be gradually balanced among the disks. + +#. .. _alm_38001__en-us_topic_0191813921_li9770103214530: + + Check whether the cluster capacity needs to be expanded. + + - If yes, add nodes to the cluster and go to :ref:`17 `. + - If no, go to :ref:`17 `. + +#. .. _alm_38001__en-us_topic_0191813921_li4770432185318: + + Wait a moment and then check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`18 `. + +#. .. _alm_38001__en-us_topic_0191813921_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38002_heap_memory_usage_of_kafka_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38002_heap_memory_usage_of_kafka_exceeds_the_threshold.rst new file mode 100644 index 0000000..4de0c10 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-38002_heap_memory_usage_of_kafka_exceeds_the_threshold.rst @@ -0,0 +1,84 @@ +:original_name: alm_38002.html + +.. _alm_38002: + +ALM-38002 Heap Memory Usage of Kafka Exceeds the Threshold +========================================================== + +Description +----------- + +The system checks the heap memory usage of Kafka every 30 seconds. This alarm is generated if the heap memory usage of Kafka exceeds the threshold (80%). + +This alarm is cleared if the heap memory usage is lower than the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +38002 Major Yes +======== ============== ========== + +Parameters +---------- + ++-------------------+-------------------------------------------------------------------------------------+ +| Parameter | Description | ++===================+=====================================================================================+ +| ServiceName | Specifies the service for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| RoleName | Specifies the role for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| HostName | Specifies the host for which the alarm is generated. | ++-------------------+-------------------------------------------------------------------------------------+ +| Trigger Condition | Generates an alarm when the actual indicator value exceeds the specified threshold. | ++-------------------+-------------------------------------------------------------------------------------+ + +Impact on the System +-------------------- + +Memory overflow may occur, causing service crashes. + +Possible Causes +--------------- + +The heap memory usage is high or the heap memory is improperly allocated. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the MRS cluster details page and choose **Alarms**. + + b. Choose **ALM-38002 Kafka Heap Memory Usage Exceeds the Threshold** > **Location**. Query the IP address of the alarmed instance. + + c. Choose **Components > Kafka > Instance > Broker (corresponding to the IP address of the alarmed instance) > Customize > Kafka Heap Memory Resource Percentage** to check the heap memory usage. + + d. Check whether the heap memory usage of Kafka has reached the threshold (80%). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_38002__en-us_topic_0191813873_li1011493181634: + + Choose **Components** > **Kafka** > **Service Configuration** > **All** > **Broker** > **Environment Variables**. Increase the value of **KAFKA_HEAP_OPTS** as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_38002__en-us_topic_0191813873_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Related Information +------------------- + +N/A diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43001_spark_service_unavailable.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43001_spark_service_unavailable.rst new file mode 100644 index 0000000..33da5f2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43001_spark_service_unavailable.rst @@ -0,0 +1,88 @@ +:original_name: alm_43001.html + +.. _alm_43001: + +ALM-43001 Spark Service Unavailable +=================================== + +Description +----------- + +The system checks the Spark service status every 60 seconds. This alarm is generated when the Spark service is unavailable. + +This alarm is cleared when the Spark service recovers. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43001 Critical Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +The Spark tasks submitted by users fail to be executed. + +Possible Causes +--------------- + +- The KrbServer service is abnormal. +- The LdapServer service is abnormal. +- ZooKeeper is abnormal. +- The HDFS service is abnormal. +- The Yarn service is abnormal. +- The corresponding Hive service is abnormal. + +Procedure +--------- + +#. Check whether service unavailability alarms exist in services that Spark depends on. + + a. Go to the cluster details page and choose **Alarms**. + + b. Check whether the following alarms exist in the alarm list: + + #. ALM-25500 KrbServer Service Unavailable + #. ALM-25000 LdapServer Service Unavailable + #. ALM-13000 ZooKeeper Service Unavailable + #. ALM-14000 HDFS Service Unavailable + #. ALM-18000 Yarn Service Unavailable + #. ALM-16004 Hive Service Unavailable + + - If yes, go to :ref:`1.c `. + - If no, go to :ref:`2 `. + + c. .. _alm_43001__en-us_topic_0191813893_li1257801171836: + + Handle the alarm according to the alarm help. + + After the alarm is cleared, wait a few minutes and check whether the alarm HetuServer Service Unavailable is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43001__en-us_topic_0191813893_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..0ba6184 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43006_heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst @@ -0,0 +1,80 @@ +:original_name: alm_43006.html + +.. _alm_43006: + +ALM-43006 Heap Memory Usage of the JobHistory Process Exceeds the Threshold +=========================================================================== + +Description +----------- + +The system checks the JobHistory process status every 30 seconds. The alarm is generated when the heap memory usage of the JobHistory process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43006 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JobHistory process heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the JobHistory process is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43006** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JobHistory** (IP address of the instance for which the alarm is generated) > **Customize** > **Heap Memory Statistics of the JobHistory Process**. Click **OK** to view the heap memory usage. + + d. Check whether the used heap memory of JobHistory reaches 90% of the maximum heap memory specified for JobHistory. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43006__en-us_topic_0191813968_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JobHistory** > **Default**. Increase the value of **SPARK_DAEMON_MEMORY** as required. + + f. Click **Save Configuration** and select **Restart the affected services or instances**. Click **OK**. + + g. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43006__en-us_topic_0191813968_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..6082e51 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43007_non-heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43007.html + +.. _alm_43007: + +ALM-43007 Non-Heap Memory Usage of the JobHistory Process Exceeds the Threshold +=============================================================================== + +Description +----------- + +The system checks the JobHistory process status every 30 seconds. The alarm is generated when the non-heap memory usage of the JobHistory process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43007 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JobHistory process non-heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The non-heap memory of the JobHistory process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +#. Check non-heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43007** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JobHistory** (IP address of the instance for which the alarm is generated) > **Customize** > **Non-Heap Memory Statistics of the JobHistory Process**. Click **OK** to view the non-heap memory usage. + + d. Check whether the non-heap memory usage of JobHistory has reached the threshold (90% of the maximum memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43007__en-us_topic_0191813875_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JobHistory** > **Default**. Increase the value of **-XX:MaxMetaspaceSize** in **SPARK_DAEMON_JAVA_OPTS** as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43007__en-us_topic_0191813875_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43008_direct_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43008_direct_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..f3120fd --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43008_direct_memory_usage_of_the_jobhistory_process_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43008.html + +.. _alm_43008: + +ALM-43008 Direct Memory Usage of the JobHistory Process Exceeds the Threshold +============================================================================= + +Description +----------- + +The system checks the JobHistory process status every 30 seconds. The alarm is generated when the direct memory usage of the JobHistory process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43008 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JobHistory process direct memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the JobHistory process is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +#. Check the direct memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43008** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JobHistory** (IP address of the instance for which the alarm is generated) > **Customize** > **Direct Memory Statistics of the JobHistory Process**. Click **OK** to view the direct memory usage. + + d. Check whether the direct memory usage of the JobHistory process has reached the threshold (90% of the maximum direct memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43008__en-us_topic_0191813884_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JobHistory** > **Default**. Increase the value of **-XX:MaxDirectMemorySize** in **SPARK_DAEMON_JAVA_OPTS** as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43008__en-us_topic_0191813884_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43009_jobhistory_gc_time_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43009_jobhistory_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..b6eb08d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43009_jobhistory_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43009.html + +.. _alm_43009: + +ALM-43009 JobHistory GC Time Exceeds the Threshold +================================================== + +Description +----------- + +The system checks the GC time of the JobHistory process every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (12 seconds) for three consecutive times. You can change the threshold by choosing **System** > **Threshold Configuration** > **Service** > **Spark** > **JobHistory GC Time** > **Total JobHistory GC Time**. This alarm is cleared when the JobHistory GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43009 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, JobHistory may run in low performance. + +Possible Causes +--------------- + +The heap memory of the JobHistory process is overused or inappropriately allocated, causing frequent GC. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43009** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JobHistory** (IP address of the instance for which the alarm is generated) > **Customize** > **GC Time of the JobHistory Process**. Click **OK** to view the GC time. + + d. Check whether the GC time of the JobHistory process is longer than 12 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43009__en-us_topic_0191813930_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JobHistory** > **Default**. Increase the value of the **SPARK_DAEMON_MEMORY** parameter as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43009__en-us_topic_0191813930_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..e617011 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43010_heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43010.html + +.. _alm_43010: + +ALM-43010 Heap Memory Usage of the JDBCServer Process Exceeds the Threshold +=========================================================================== + +Description +----------- + +The system checks the JDBCServer process status every 30 seconds. The alarm is generated when the heap memory usage of the JDBCServer process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43010 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JDBCServer process heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The heap memory of the JDBCServer process is overused or the heap memory is inappropriately allocated. + +Procedure +--------- + +#. Check the heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43010** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JDBCServer** (IP address of the instance for which the alarm is generated) > **Customize** > **Heap Memory Statistics of the JDBCServer Process**. Click **OK** to view the heap memory usage. + + d. Check whether the heap memory usage of JDBCServer has reached the threshold (90% of the maximum heap memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43010__en-us_topic_0191813931_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JDBCServer** > **Tuning**. Increase the value of the **SPARK_DRIVER_MEMORY** parameter as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43010__en-us_topic_0191813931_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..e062369 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43011_non-heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43011.html + +.. _alm_43011: + +ALM-43011 Non-Heap Memory Usage of the JDBCServer Process Exceeds the Threshold +=============================================================================== + +Description +----------- + +The system checks the JDBCServer process status every 30 seconds. The alarm is generated when the non-heap memory usage of the JDBCServer process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43011 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JDBCServer process non-heap memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The non-heap memory of the JDBCServer process is overused or the non-heap memory is inappropriately allocated. + +Procedure +--------- + +#. Check non-heap memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43011** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JDBCServer** (IP address of the instance for which the alarm is generated) > **Customize** > **Non-heap Memory Statistics of the JDBCServer Process**. Click **OK** to view the non-heap memory usage. + + d. Check whether the non-heap memory usage of JDBCServer has reached the threshold (90% of the maximum non-heap memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43011__en-us_topic_0191813888_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JDBCServer** > **Tuning**. Increase the value of **-XX:MaxMetaspaceSize** in **spark.driver.extraJavaOptions** as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43011__en-us_topic_0191813888_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43012_direct_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43012_direct_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst new file mode 100644 index 0000000..8fb3367 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43012_direct_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43012.html + +.. _alm_43012: + +ALM-43012 Direct Memory Usage of the JDBCServer Process Exceeds the Threshold +============================================================================= + +Description +----------- + +The system checks the JDBCServer process status every 30 seconds. The alarm is generated when the direct memory usage of the JDBCServer process exceeds the threshold (90% of the maximum memory). + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43012 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the available JDBCServer process direct memory is insufficient, a memory overflow occurs and the service breaks down. + +Possible Causes +--------------- + +The direct memory of the JDBCServer process is overused or the direct memory is inappropriately allocated. + +Procedure +--------- + +#. Check the direct memory usage. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43012** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JDBCServer** (IP address of the instance for which the alarm is generated) > **Customize** > **Direct Memory Statistics of the JDBCServer Process**. Click **OK** to view the direct memory usage. + + d. Check whether the direct memory usage of the JDBCServer process has reached the threshold (90% of the maximum direct memory). + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43012__en-us_topic_0191813924_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JDBCServer** > **Tuning**. Increase the value of **-XX:MaxDirectMemorySize** in **spark.driver.extraJavaOptions** as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43012__en-us_topic_0191813924_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43013_jdbcserver_gc_time_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43013_jdbcserver_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..8364614 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-43013_jdbcserver_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_43013.html + +.. _alm_43013: + +ALM-43013 JDBCServer GC Time Exceeds the Threshold +================================================== + +Description +----------- + +The system checks the GC time of the JDBCServer process every 60 seconds. This alarm is generated when the detected GC time exceeds the threshold (12 seconds) for three consecutive times. You can change the threshold by choosing **System** > **Threshold Configuration** > **Service** > **Spark** > **JDBCServer GC Time** > **Total JDBCServer GC Time**. This alarm is cleared when the JDBCServer GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ===================== +Alarm ID Alarm Severity Automatically Cleared +======== ============== ===================== +43013 Major Yes +======== ============== ===================== + +Parameters +---------- + +=========== ======================================================= +Parameter Description +=========== ======================================================= +ServiceName Specifies the service for which the alarm is generated. +RoleName Specifies the role for which the alarm is generated. +HostName Specifies the host for which the alarm is generated. +=========== ======================================================= + +Impact on the System +-------------------- + +If the GC time exceeds the threshold, JDBCServer may run in low performance. + +Possible Causes +--------------- + +The heap memory of the JDBCServer process is overused or inappropriately allocated, causing frequent GC. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **43013** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Spark** > **Instance** > **JDBCServer** (IP address of the instance for which the alarm is generated) > **Customize** > **GC Time of the JDBCServer Process**. Click **OK** to view the GC time. + + d. Check whether the GC time of the JDBCServer process is longer than 12 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_43013__en-us_topic_0191813943_li1011493181634: + + Choose **Components** > **Spark** > **Service Configuration**. Set **Type** to **All** and choose **JDBCServer** > **Tuning**. Increase the value of the **SPARK_DRIVER_MEMORY** parameter as required. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_43013__en-us_topic_0191813943_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst new file mode 100644 index 0000000..ccf760f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold.rst @@ -0,0 +1,57 @@ +:original_name: alm_44004.html + +.. _alm_44004: + +ALM-44004 Presto Coordinator Resource Group Queuing Tasks Exceed the Threshold +============================================================================== + +Description +----------- + +This alarm is generated when the system detects that the number of queuing tasks in a resource group exceeds the threshold. The system queries the number of queuing tasks in a resource group through the JMX interface. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Presto** > **resource-groups** to configure a resource group. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Coordinator** > **Customize** > **resourceGroupAlarm** to configure the threshold of each resource group. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44004 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the number of queuing tasks in a resource group exceeds the threshold, a large number of tasks may be in the queuing state. The Presto task time exceeds the expected value. When the number of queuing tasks in a resource group exceeds the maximum number (**maxQueued**) of queuing tasks in the resource group, new tasks cannot be executed. + +Possible Causes +--------------- + +The resource group configuration is improper or too many tasks in the resource group are submitted. + +Procedure +--------- + +#. Choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Presto** > **resource-groups** to adjust the resource group configuration. +#. You can choose **Components** > **Presto** > **Service Configuration** (switch **Basic** to **All**) > **Coordinator** > **Customize** > **resourceGroupAlarm** to modify the threshold of each resource group. +#. Collect fault information. + + a. Log in to the cluster node based on the host name in the fault information and query the number of queuing tasks based on **Resource Group** in the additional information on the Presto client. + b. Log in to the cluster node based on the host name in the fault information, view the **/var/log/Bigdata/nodeagent/monitorlog/monitor.log** file, and search for resource group information to view the monitoring collection information of the resource group. + c. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..4c04690 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_44005.html + +.. _alm_44005: + +ALM-44005 Presto Coordinator Process GC Time Exceeds the Threshold +================================================================== + +Description +----------- + +The system collects GC time of the Presto Coordinator process every 30 seconds. This alarm is generated when the GC time exceeds the threshold (exceeds 5 seconds for three consecutive times). You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Presto** > **Coordinator** > **Presto Process Garbage Collection Time** > **Garbage Collection Time of the Coordinator Process** on MRS Manager. This alarm is cleared when the Coordinator process GC time is less than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44005 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the GC time of the Coordinator process is too long, the Coordinator process running performance will be affected and the Coordinator process will even be unavailable. + +Possible Causes +--------------- + +The heap memory of the Coordinator process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **44005** and view the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Presto** > **Instances** > **Coordinator** (business IP address of the instance for which the alarm is generated) > **Customize** > **Presto Garbage Collection Time**. Click **OK** to view the GC time. + + d. Check whether the GC time of the Coordinator process is longer than 5 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_44005__en-us_topic_0225312712_li1011493181634: + + Choose **Components** > **Presto** > **Service Configuration**, and switch **Basic** to **All**. Choose **Presto** > **Coordinator**. Increase the value of **-Xmx** (maximum heap memory) in the **JAVA_OPTS** parameter based on the site requirements. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_44005__en-us_topic_0225312712_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst new file mode 100644 index 0000000..a0ed338 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/alm-44006_presto_worker_process_gc_time_exceeds_the_threshold.rst @@ -0,0 +1,78 @@ +:original_name: alm_44006.html + +.. _alm_44006: + +ALM-44006 Presto Worker Process GC Time Exceeds the Threshold +============================================================= + +Description +----------- + +The system collects GC time of the Presto Worker process every 30 seconds. This alarm is generated when the GC time exceeds the threshold (exceeds 5 seconds for three consecutive times). You can change the threshold by choosing **System** > **Configure Alarm Threshold** > **Service** > **Presto** > **Worker** > **Presto Garbage Collection Time** > **Garbage Collection Time of the Worker Process** on MRS Manager. This alarm is cleared when the Worker process GC time is shorter than or equal to the threshold. + +Attribute +--------- + +======== ============== ========== +Alarm ID Alarm Severity Auto Clear +======== ============== ========== +44006 Major Yes +======== ============== ========== + +Parameter +--------- + +=========== ========================================= +Parameter Description +=========== ========================================= +ServiceName Service for which the alarm is generated. +RoleName Role for which the alarm is generated. +HostName Host for which the alarm is generated. +=========== ========================================= + +Impact on the System +-------------------- + +If the GC time of the Worker process is too long, the Worker process running performance will be affected and the Worker process will even be unavailable. + +Possible Causes +--------------- + +The heap memory of the Worker process is overused or inappropriately allocated, causing frequent occurrence of the GC process. + +Procedure +--------- + +#. Check the GC time. + + a. Go to the cluster details page and choose **Alarms**. + + b. Select the alarm whose **Alarm ID** is **44006**. Then check the IP address and role name of the instance in **Location**. + + c. Choose **Components** > **Presto** > **Instances** > **Worker** (business IP address of the instance for which the alarm is generated) > **Customize** > **Presto Garbage Collection Time**. Click **OK** to view the GC time. + + d. Check whether the GC time of the Worker process is longer than 5 seconds. + + - If yes, go to :ref:`1.e `. + - If no, go to :ref:`2 `. + + e. .. _alm_44006__en-us_topic_0225312713_li3841416113916: + + Choose **Components** > **Presto** > **Service Configuration**, and switch **Basic** to **All**, and choose **Presto** > **Worker** Increase the value of **-Xmx** (maximum heap memory) in the **JAVA_OPTS** parameter based on the site requirements. + + f. Check whether the alarm is cleared. + + - If yes, no further action is required. + - If no, go to :ref:`2 `. + +#. .. _alm_44006__en-us_topic_0225312713_li572522141314: + + Collect fault information. + + a. On MRS Manager, choose **System** > **Export Log**. + b. Contact technical support engineers for help. For details, see `technical support `__. + +Reference +--------- + +None diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/index.rst new file mode 100644 index 0000000..336a38b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/index.rst @@ -0,0 +1,224 @@ +:original_name: mrs_01_0241.html + +.. _mrs_01_0241: + +Alarm Reference (Applicable to Versions Earlier Than MRS 3.x) +============================================================= + +- :ref:`ALM-12001 Audit Log Dump Failure ` +- :ref:`ALM-12002 HA Resource Is Abnormal ` +- :ref:`ALM-12004 OLdap Resource Is Abnormal ` +- :ref:`ALM-12005 OKerberos Resource Is Abnormal ` +- :ref:`ALM-12006 Node Fault ` +- :ref:`ALM-12007 Process Fault ` +- :ref:`ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes ` +- :ref:`ALM-12011 Data Synchronization Exception Between the Active and Standby Manager Nodes ` +- :ref:`ALM-12012 NTP Service Is Abnormal ` +- :ref:`ALM-12016 CPU Usage Exceeds the Threshold ` +- :ref:`ALM-12017 Insufficient Disk Capacity ` +- :ref:`ALM-12018 Memory Usage Exceeds the Threshold ` +- :ref:`ALM-12027 Host PID Usage Exceeds the Threshold ` +- :ref:`ALM-12028 Number of Processes in the D State on the Host Exceeds the Threshold ` +- :ref:`ALM-12031 User omm or Password Is About to Expire ` +- :ref:`ALM-12032 User ommdba or Password Is About to Expire ` +- :ref:`ALM-12033 Slow Disk Fault ` +- :ref:`ALM-12034 Periodic Backup Failure ` +- :ref:`ALM-12035 Unknown Data Status After Recovery Task Failure ` +- :ref:`ALM-12037 NTP Server Is Abnormal ` +- :ref:`ALM-12038 Monitoring Indicator Dump Failure ` +- :ref:`ALM-12039 GaussDB Data Is Not Synchronized ` +- :ref:`ALM-12040 Insufficient System Entropy ` +- :ref:`ALM-13000 ZooKeeper Service Unavailable ` +- :ref:`ALM-13001 Available ZooKeeper Connections Are Insufficient ` +- :ref:`ALM-13002 ZooKeeper Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14000 HDFS Service Unavailable ` +- :ref:`ALM-14001 HDFS Disk Usage Exceeds the Threshold ` +- :ref:`ALM-14002 DataNode Disk Usage Exceeds the Threshold ` +- :ref:`ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold ` +- :ref:`ALM-14004 Number of Damaged HDFS Blocks Exceeds the Threshold ` +- :ref:`ALM-14006 Number of HDFS Files Exceeds the Threshold ` +- :ref:`ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold ` +- :ref:`ALM-14009 Number of Faulty DataNodes Exceeds the Threshold ` +- :ref:`ALM-14010 NameService Service Is Abnormal ` +- :ref:`ALM-14011 HDFS DataNode Data Directory Is Not Configured Properly ` +- :ref:`ALM-14012 HDFS JournalNode Data Is Not Synchronized ` +- :ref:`ALM-16000 Percentage of Sessions Connected to the HiveServer to the Maximum Number Allowed Exceeds the Threshold ` +- :ref:`ALM-16001 Hive Warehouse Space Usage Exceeds the Threshold ` +- :ref:`ALM-16002 Hive SQL Execution Success Rate Is Lower Than the Threshold ` +- :ref:`ALM-16004 Hive Service Unavailable ` +- :ref:`ALM-18000 Yarn Service Unavailable ` +- :ref:`ALM-18002 NodeManager Heartbeat Lost ` +- :ref:`ALM-18003 NodeManager Unhealthy ` +- :ref:`ALM-18004 NodeManager Disk Usability Ratio Is Lower Than the Threshold ` +- :ref:`ALM-18006 MapReduce Job Execution Timeout ` +- :ref:`ALM-19000 HBase Service Unavailable ` +- :ref:`ALM-19006 HBase Replication Sync Failed ` +- :ref:`ALM-25000 LdapServer Service Unavailable ` +- :ref:`ALM-25004 Abnormal LdapServer Data Synchronization ` +- :ref:`ALM-25500 KrbServer Service Unavailable ` +- :ref:`ALM-27001 DBService Is Unavailable ` +- :ref:`ALM-27003 DBService Heartbeat Interruption Between the Active and Standby Nodes ` +- :ref:`ALM-27004 Data Inconsistency Between Active and Standby DBServices ` +- :ref:`ALM-28001 Spark Service Unavailable ` +- :ref:`ALM-26051 Storm Service Unavailable ` +- :ref:`ALM-26052 Number of Available Supervisors in Storm Is Lower Than the Threshold ` +- :ref:`ALM-26053 Slot Usage of Storm Exceeds the Threshold ` +- :ref:`ALM-26054 Heap Memory Usage of Storm Nimbus Exceeds the Threshold ` +- :ref:`ALM-38000 Kafka Service Unavailable ` +- :ref:`ALM-38001 Insufficient Kafka Disk Space ` +- :ref:`ALM-38002 Heap Memory Usage of Kafka Exceeds the Threshold ` +- :ref:`ALM-24000 Flume Service Unavailable ` +- :ref:`ALM-24001 Flume Agent Is Abnormal ` +- :ref:`ALM-24003 Flume Client Connection Failure ` +- :ref:`ALM-24004 Flume Fails to Read Data ` +- :ref:`ALM-24005 Data Transmission by Flume Is Abnormal ` +- :ref:`ALM-12041 Permission of Key Files Is Abnormal ` +- :ref:`ALM-12042 Key File Configurations Are Abnormal ` +- :ref:`ALM-23001 Loader Service Unavailable ` +- :ref:`ALM-12357 Failed to Export Audit Logs to OBS ` +- :ref:`ALM-12014 Device Partition Lost ` +- :ref:`ALM-12015 Device Partition File System Read-Only ` +- :ref:`ALM-12043 DNS Parsing Duration Exceeds the Threshold ` +- :ref:`ALM-12045 Read Packet Dropped Rate Exceeds the Threshold ` +- :ref:`ALM-12046 Write Packet Dropped Rate Exceeds the Threshold ` +- :ref:`ALM-12047 Read Packet Error Rate Exceeds the Threshold ` +- :ref:`ALM-12048 Write Packet Error Rate Exceeds the Threshold ` +- :ref:`ALM-12049 Read Throughput Rate Exceeds the Threshold ` +- :ref:`ALM-12050 Write Throughput Rate Exceeds the Threshold ` +- :ref:`ALM-12051 Disk Inode Usage Exceeds the Threshold ` +- :ref:`ALM-12052 Usage of Temporary TCP Ports Exceeds the Threshold ` +- :ref:`ALM-12053 File Handle Usage Exceeds the Threshold ` +- :ref:`ALM-12054 The Certificate File Is Invalid ` +- :ref:`ALM-12055 The Certificate File Is About to Expire ` +- :ref:`ALM-18008 Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold ` +- :ref:`ALM-18009 Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold ` +- :ref:`ALM-20002 Hue Service Unavailable ` +- :ref:`ALM-43001 Spark Service Unavailable ` +- :ref:`ALM-43006 Heap Memory Usage of the JobHistory Process Exceeds the Threshold ` +- :ref:`ALM-43007 Non-Heap Memory Usage of the JobHistory Process Exceeds the Threshold ` +- :ref:`ALM-43008 Direct Memory Usage of the JobHistory Process Exceeds the Threshold ` +- :ref:`ALM-43009 JobHistory GC Time Exceeds the Threshold ` +- :ref:`ALM-43010 Heap Memory Usage of the JDBCServer Process Exceeds the Threshold ` +- :ref:`ALM-43011 Non-Heap Memory Usage of the JDBCServer Process Exceeds the Threshold ` +- :ref:`ALM-43012 Direct Memory Usage of the JDBCServer Process Exceeds the Threshold ` +- :ref:`ALM-43013 JDBCServer GC Time Exceeds the Threshold ` +- :ref:`ALM-44004 Presto Coordinator Resource Group Queuing Tasks Exceed the Threshold ` +- :ref:`ALM-44005 Presto Coordinator Process GC Time Exceeds the Threshold ` +- :ref:`ALM-44006 Presto Worker Process GC Time Exceeds the Threshold ` +- :ref:`ALM-18010 Number of Pending Yarn Tasks Exceeds the Threshold ` +- :ref:`ALM-18011 Memory of Pending Yarn Tasks Exceeds the Threshold ` +- :ref:`ALM-18012 Number of Terminated Yarn Tasks in the Last Period Exceeds the Threshold ` +- :ref:`ALM-18013 Number of Failed Yarn Tasks in the Last Period Exceeds the Threshold ` +- :ref:`ALM-16005 Number of Failed Hive SQL Executions in the Last Period Exceeds the Threshold ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + alm-12001_audit_log_dump_failure + alm-12002_ha_resource_is_abnormal + alm-12004_oldap_resource_is_abnormal + alm-12005_okerberos_resource_is_abnormal + alm-12006_node_fault + alm-12007_process_fault + alm-12010_manager_heartbeat_interruption_between_the_active_and_standby_nodes + alm-12011_data_synchronization_exception_between_the_active_and_standby_manager_nodes + alm-12012_ntp_service_is_abnormal + alm-12016_cpu_usage_exceeds_the_threshold + alm-12017_insufficient_disk_capacity + alm-12018_memory_usage_exceeds_the_threshold + alm-12027_host_pid_usage_exceeds_the_threshold + alm-12028_number_of_processes_in_the_d_state_on_the_host_exceeds_the_threshold + alm-12031_user_omm_or_password_is_about_to_expire + alm-12032_user_ommdba_or_password_is_about_to_expire + alm-12033_slow_disk_fault + alm-12034_periodic_backup_failure + alm-12035_unknown_data_status_after_recovery_task_failure + alm-12037_ntp_server_is_abnormal + alm-12038_monitoring_indicator_dump_failure + alm-12039_gaussdb_data_is_not_synchronized + alm-12040_insufficient_system_entropy + alm-13000_zookeeper_service_unavailable + alm-13001_available_zookeeper_connections_are_insufficient + alm-13002_zookeeper_memory_usage_exceeds_the_threshold + alm-14000_hdfs_service_unavailable + alm-14001_hdfs_disk_usage_exceeds_the_threshold + alm-14002_datanode_disk_usage_exceeds_the_threshold + alm-14003_number_of_lost_hdfs_blocks_exceeds_the_threshold + alm-14004_number_of_damaged_hdfs_blocks_exceeds_the_threshold + alm-14006_number_of_hdfs_files_exceeds_the_threshold + alm-14007_hdfs_namenode_memory_usage_exceeds_the_threshold + alm-14008_hdfs_datanode_memory_usage_exceeds_the_threshold + alm-14009_number_of_faulty_datanodes_exceeds_the_threshold + alm-14010_nameservice_service_is_abnormal + alm-14011_hdfs_datanode_data_directory_is_not_configured_properly + alm-14012_hdfs_journalnode_data_is_not_synchronized + alm-16000_percentage_of_sessions_connected_to_the_hiveserver_to_the_maximum_number_allowed_exceeds_the_threshold + alm-16001_hive_warehouse_space_usage_exceeds_the_threshold + alm-16002_hive_sql_execution_success_rate_is_lower_than_the_threshold + alm-16004_hive_service_unavailable + alm-18000_yarn_service_unavailable + alm-18002_nodemanager_heartbeat_lost + alm-18003_nodemanager_unhealthy + alm-18004_nodemanager_disk_usability_ratio_is_lower_than_the_threshold + alm-18006_mapreduce_job_execution_timeout + alm-19000_hbase_service_unavailable + alm-19006_hbase_replication_sync_failed + alm-25000_ldapserver_service_unavailable + alm-25004_abnormal_ldapserver_data_synchronization + alm-25500_krbserver_service_unavailable + alm-27001_dbservice_is_unavailable + alm-27003_dbservice_heartbeat_interruption_between_the_active_and_standby_nodes + alm-27004_data_inconsistency_between_active_and_standby_dbservices + alm-28001_spark_service_unavailable + alm-26051_storm_service_unavailable + alm-26052_number_of_available_supervisors_in_storm_is_lower_than_the_threshold + alm-26053_slot_usage_of_storm_exceeds_the_threshold + alm-26054_heap_memory_usage_of_storm_nimbus_exceeds_the_threshold + alm-38000_kafka_service_unavailable + alm-38001_insufficient_kafka_disk_space + alm-38002_heap_memory_usage_of_kafka_exceeds_the_threshold + alm-24000_flume_service_unavailable + alm-24001_flume_agent_is_abnormal + alm-24003_flume_client_connection_failure + alm-24004_flume_fails_to_read_data + alm-24005_data_transmission_by_flume_is_abnormal + alm-12041_permission_of_key_files_is_abnormal + alm-12042_key_file_configurations_are_abnormal + alm-23001_loader_service_unavailable + alm-12357_failed_to_export_audit_logs_to_obs + alm-12014_device_partition_lost + alm-12015_device_partition_file_system_read-only + alm-12043_dns_parsing_duration_exceeds_the_threshold + alm-12045_read_packet_dropped_rate_exceeds_the_threshold + alm-12046_write_packet_dropped_rate_exceeds_the_threshold + alm-12047_read_packet_error_rate_exceeds_the_threshold + alm-12048_write_packet_error_rate_exceeds_the_threshold + alm-12049_read_throughput_rate_exceeds_the_threshold + alm-12050_write_throughput_rate_exceeds_the_threshold + alm-12051_disk_inode_usage_exceeds_the_threshold + alm-12052_usage_of_temporary_tcp_ports_exceeds_the_threshold + alm-12053_file_handle_usage_exceeds_the_threshold + alm-12054_the_certificate_file_is_invalid + alm-12055_the_certificate_file_is_about_to_expire + alm-18008_heap_memory_usage_of_yarn_resourcemanager_exceeds_the_threshold + alm-18009_heap_memory_usage_of_mapreduce_jobhistoryserver_exceeds_the_threshold + alm-20002_hue_service_unavailable + alm-43001_spark_service_unavailable + alm-43006_heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold + alm-43007_non-heap_memory_usage_of_the_jobhistory_process_exceeds_the_threshold + alm-43008_direct_memory_usage_of_the_jobhistory_process_exceeds_the_threshold + alm-43009_jobhistory_gc_time_exceeds_the_threshold + alm-43010_heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold + alm-43011_non-heap_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold + alm-43012_direct_memory_usage_of_the_jdbcserver_process_exceeds_the_threshold + alm-43013_jdbcserver_gc_time_exceeds_the_threshold + alm-44004_presto_coordinator_resource_group_queuing_tasks_exceed_the_threshold + alm-44005_presto_coordinator_process_gc_time_exceeds_the_threshold + alm-44006_presto_worker_process_gc_time_exceeds_the_threshold + alm-18010_number_of_pending_yarn_tasks_exceeds_the_threshold + alm-18011_memory_of_pending_yarn_tasks_exceeds_the_threshold + alm-18012_number_of_terminated_yarn_tasks_in_the_last_period_exceeds_the_threshold + alm-18013_number_of_failed_yarn_tasks_in_the_last_period_exceeds_the_threshold + alm-16005_number_of_failed_hive_sql_executions_in_the_last_period_exceeds_the_threshold diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/backing_up_metadata.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/backing_up_metadata.rst new file mode 100644 index 0000000..0743a09 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/backing_up_metadata.rst @@ -0,0 +1,64 @@ +:original_name: mrs_01_0553.html + +.. _mrs_01_0553: + +Backing Up Metadata +=================== + +Scenario +-------- + +To ensure the security of metadata either on a routine basis or before and after performing critical metadata operations (such as scale-out, scale-in, patch installation, upgrades, and migration), metadata must be backed up. The backup data can be used to recover the system if an exception occurs or if the operation has not achieved the expected result. This minimizes the adverse impact on services. Metadata includes data of OMS, LdapServer, DBService, and NameNode. MRS Manager data to be backed up includes OMS data and LdapServer data. + +By default, metadata backup is supported by the **default** task. This section describes how to create a backup task and back up metadata on MRS Manager. Both automatic backup tasks and manual backup tasks are supported. + +Prerequisites +------------- + +- A standby cluster for backing up data has been created, and the network is connected. The inbound rules of the two security groups on the peer cluster have been added to the two security groups in each cluster to allow all access requests of all protocols and ports of all ECSs in the security groups. +- The backup type, period, policy, and other specifications have been planned based on the service requirements and you have checked whether *Data storage path*\ **/LocalBackup/** has sufficient space on the active and standby management nodes. + +Procedure +--------- + +#. Create a backup task. + + a. On MRS Manager, choose **System** > **Back Up Data**. + b. Click **Create Backup Task**. + +#. Configure a backup policy. + + a. Set **Task Name** to the name of the backup task. + + b. Set **Backup Mode** to the type of the backup task. **Periodic** indicates that the backup task is periodically executed. **Manual** indicates that the backup task is manually executed. + + To create a periodic backup task, set the following parameters: + + - **Started**: indicates the time when the task is started for the first time. + - **Period**: indicates the task execution interval. The options include **By hour** and **By day**. + - **Backup Policy**: indicates the volume of data to be backed up in each task execution. The options include **Full backup at the first time and incremental backup later**, **Full backup every time**, and **Full backup once every n times**. If you select **Full backup once every n times**, you need to specify the value of **n**. + +#. Select backup sources. + + In the **Configuration** area, select **OMS** and **LdapServer** under **Metadata**. + +#. Set backup parameters. + + a. Set **Path Type** of **OMS** and **LdapServer** to a backup directory type. + + The following backup directory types are supported: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node and the standby management node automatically synchronizes the backup files. By default, the backup files are stored in *Data storage path*\ **/LocalBackup/**. If you select **LocalDir**, you need to set the maximum number of copies to specify the number of backup files that can be retained in the backup directory. + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. If you select **SFTP**, set the following parameters: + + - **Target Path**: indicates the HDFS directory for storing the backup files. The save path cannot be an HDFS hidden directory, such as a snapshot or recycle bin directory, or a default system directory. + - **Max Number of Backup Copies**: indicates the number of backup files that can be retained in the backup directory. + - **Target Instance Name**: indicates the NameService name of the backup directory. The default value is **hacluster**. + + b. Click **OK**. + +#. Execute the backup task. + + In the **Operation** column of the created task in the backup task list, click **Back Up Now** if **Backup Mode** is set to **Periodic** or click **Start** If **Backup Mode** is set to **Manual** to execute the backup task. + + After the backup task is executed, the system automatically creates a subdirectory for each backup task in the backup directory. The format of the subdirectory name is *Backup task name*\ **\_**\ *Task creation time*, and the subdirectory is used to save data source backup files. The format of the backup file name is *Version_Data source_Task execution time*\ **.tar.gz**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/index.rst new file mode 100644 index 0000000..d38979c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_0550.html + +.. _mrs_01_0550: + +Backup and Restoration +====================== + +- :ref:`Introduction ` +- :ref:`Backing Up Metadata ` +- :ref:`Restoring Metadata ` +- :ref:`Modifying a Backup Task ` +- :ref:`Viewing Backup and Restoration Tasks ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction + backing_up_metadata + restoring_metadata + modifying_a_backup_task + viewing_backup_and_restoration_tasks diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/introduction.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/introduction.rst new file mode 100644 index 0000000..8f27f1f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/introduction.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_0551.html + +.. _mrs_01_0551: + +Introduction +============ + +Purpose +------- + +MRS Manager provides backup and restoration for user data and system data. The backup function is provided based on components to back up Manager data (including OMS data and LdapServer data), Hive user data, component metadata saved in DBService, and HDFS metadata. + +Backup and restoration tasks are performed in the following scenarios: + +- Routine backup is performed to ensure the data security of the system and components. +- If the system is faulty, the data backup can be used to recover the system. +- If the active cluster is completely faulty, a mirror cluster identical to the active cluster needs to be created. You can use the backup data to restore the active cluster. + +.. table:: **Table 1** Backing up metadata + + +-------------+-------------------------------------------------------------------------------------------------------------------------+ + | Backup Type | Backup Content | + +=============+=========================================================================================================================+ + | OMS | Database data (excluding alarm data) and configuration data in the cluster management system to be backed up by default | + +-------------+-------------------------------------------------------------------------------------------------------------------------+ + | LdapServer | User information, including the username, password, key, password policy, and group information | + +-------------+-------------------------------------------------------------------------------------------------------------------------+ + | DBService | Metadata of the components (Hive) managed by DBService | + +-------------+-------------------------------------------------------------------------------------------------------------------------+ + | NameNode | HDFS metadata. | + +-------------+-------------------------------------------------------------------------------------------------------------------------+ + +Principles +---------- + +**Task** + +Before backup or restoration, you need to create a backup or restoration task and set task parameters, such as the task name, backup data source, and type of backup file save path. Data backup and restoration can be performed by executing backup and restoration tasks. When the Manager is used to recover the data of HDFS, HBase, Hive, and NameNode, no cluster can be accessed. + +Each backup task can back up data of different data sources and generates an independent backup file for each data source. All the backup files generated in each backup task form a backup file set, which can be used in restoration tasks. Backup data can be stored on Linux local disks, local cluster HDFS, and standby cluster HDFS. The backup task provides the full backup or incremental backup policies. HDFS and Hive backup tasks support the incremental backup policy, while OMS, LdapServer, DBService, and NameNode backup tasks support only the full backup policy. + +.. note:: + + Task execution rules: + + - If a task is being executed, the task cannot be executed repeatedly and other tasks cannot be started at the same time. + - The interval at which a periodical task is automatically executed must be greater than 120s; otherwise, the task is postponed and will be executed in the next period. Manual tasks can be executed at any interval. + - When a periodic task is to be automatically executed, the current time cannot be 120s later than the task start time; otherwise, the task is postponed and executed in the next period. + - When a periodic task is locked, it cannot be automatically executed and needs to be manually unlocked. + - Before an OMS, LdapServer, DBService, or NameNode backup task starts, ensure that the LocalBackup partition on the active management node has more than 20 GB available space. Otherwise, the backup task cannot be started. + - When you are planning backup and restoration tasks, select the data to be backed up or restored strictly based on the service logic, data store structure, and database or table association. The system creates a default periodic backup task **default** whose execution interval is 24 hours to perform full backup of OMS, LdapServer, DBService, and NameNode data to the Linux local disk. + +Specifications +-------------- + +.. table:: **Table 2** Backup and restoration feature specifications + + ======================================================= ============== + Item Specifications + ======================================================= ============== + Maximum number of backup or restoration tasks 100 + Number of concurrent running tasks 1 + Maximum number of waiting tasks 199 + Maximum size of backup files on a Linux local disk (GB) 600 + ======================================================= ============== + +.. table:: **Table 3** Specifications of the **default** task + + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + | Item | OMS | LdapServer | DBService | NameNode | + +=================================+================================================================================+============+===========+==========+ + | Backup period | 1 hour | | | | + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum number of copies | 2 | | | | + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum size of a backup file | 10 MB | 20 MB | 100 MB | 1.5 GB | + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + | Maximum size of disk space used | 20 MB | 40 MB | 200 MB | 3 GB | + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + | Save path of backup data | *Data save path*\ **/LocalBackup/** of the active and standby management nodes | | | | + +---------------------------------+--------------------------------------------------------------------------------+------------+-----------+----------+ + +.. note:: + + The backup data of the **default** task must be periodically transferred and saved outside the cluster based on the enterprise O&M requirements. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/modifying_a_backup_task.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/modifying_a_backup_task.rst new file mode 100644 index 0000000..3b6b2a8 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/modifying_a_backup_task.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_0558.html + +.. _mrs_01_0558: + +Modifying a Backup Task +======================= + +Scenario +-------- + +This section describes how to modify the parameters of a created backup task on MRS Manager to meet changing service requirements. The parameters of restoration tasks can be viewed but not modified. + +Impact on the System +-------------------- + +After a backup task is modified, the new parameters take effect when the task is executed next time. + +Prerequisites +------------- + +- A backup task has been created. +- A new backup task policy has been planned based on the actual situation. + +Procedure +--------- + +#. On MRS Manager, choose **System** > **Back Up Data**. +#. In the task list, locate a specified task, click **Modify** in the **Operation** column to go to the configuration modification page. +#. Modify the following parameters on the displayed page: + + - Manual backup: + + - Target Path + - Max Number of Backup Copies + + - Periodic backup: + + - Started + - Period + - Target Path + - Max Number of Backup Copies + + .. note:: + + - When **Path Type** is set to **LocalHDFS**, **Target Path** is valid for modifying a backup task. + - After you change the value of **Target Path** for a backup task, full backup is performed by default when the task is executed for the first time. + +#. Click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/restoring_metadata.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/restoring_metadata.rst new file mode 100644 index 0000000..37408a5 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/restoring_metadata.rst @@ -0,0 +1,113 @@ +:original_name: mrs_01_0555.html + +.. _mrs_01_0555: + +Restoring Metadata +================== + +Scenario +-------- + +You need to restore metadata in the following scenarios: A user modifies or deletes data unexpectedly, data needs to be retrieved, system data becomes abnormal or does not achieve the expected result, all modules are faulty, and data is migrated to a new cluster. + +This section describes how to restore metadata on MRS Manager. Only manual restoration tasks are supported. + +.. important:: + + - Data restoration can be performed only when the system version is consistent with that during data backup. + - To restore the data when services are normal, manually back up the latest management data first and then restore the data. Otherwise, the data that is generated after the data backup and before the data restoration will be lost. + - Use the OMS data and LdapServer data backed up at the same time to restore data. Otherwise, the service and operation may fail. + - By default, MRS clusters use DBService to store Hive metadata. + +Impact on the System +-------------------- + +- After the data is restored, the data generated between the backup time and restoration time is lost. +- After the data is restored, the configuration of the components that depend on DBService may expire and these components need to be restarted. + +Prerequisites +------------- + +- The data in the OMS and LdapServer backup files has been backed up at the same time. +- The status of the OMS resources and the LdapServer instances is normal. If the status is abnormal, data restoration cannot be performed. +- The status of the cluster hosts and services is normal. If the status is abnormal, data restoration cannot be performed. +- The cluster host topologies during data restoration and data backup are the same. If the topologies are different, data restoration cannot be performed and you need to back up data again. +- The services added to the cluster during data restoration and data backup are the same. If the services are different, data restoration cannot be performed and you need to back up data again. +- The status of the active and standby DBService instances is normal. If the status is abnormal, data restoration cannot be performed. +- The upper-layer applications depending on the MRS cluster have been stopped. +- On MRS Manager, you have stopped all the NameNode role instances whose data is to be recovered. Other HDFS role instances are running properly. After data is recovered, the NameNode role instances need to be restarted and cannot be accessed before the restart. +- You have checked whether NameNode backup files have been stored in the *Data save path*\ **/LocalBackup/** directory on the active management node. + +Procedure +--------- + +#. Check the location of backup data. + + a. On MRS Manager, choose **System** > **Back Up Data**. + b. In the row where the specified backup task resides, choose **More** > **View History** in the **Operation** column to display the historical execution records of the backup task. In the window that is displayed, select a success record and click **View Backup Path** in the corresponding column to view its backup path information. Find the following information: + + - **Backup Object**: indicates the backup data source. + - **Backup Path**: indicates the full path where backup files are stored. + + c. Select the correct path, and manually copy the full path of backup files in **Backup Path**. + +#. Create a restoration task. + + a. On MRS Manager, choose System > Recovery Management. + b. On the page that is displayed, click **Create Restoration Task**. + c. Set **Task Name** to the name of the restoration task. + +#. Select restoration sources. + + In **Configuration**, select the metadata component whose data is to be restored. + +#. Set the restoration parameters. + + a. Set **Path Type** to a backup directory type. + b. The settings vary according to backup directory types: + + - **LocalDir**: indicates that the backup files are stored on the local disk of the active management node. If you select **LocalDir**, you need to set **Source Path** to specify the full path of the backup file. For example, *Data storage path*\ **/LocalBackup/**\ *Backup task name*\ **\_**\ *Task creation time*\ **/**\ *Data source*\ **\_**\ *Task execution time*\ **/**\ *Version number*\ **\_**\ *Data source*\ **\_**\ *Task execution time*\ **.tar.gz**. + - **LocalHDFS**: indicates that the backup files are stored in the HDFS directory of the current cluster. If you select **SFTP**, set the following parameters: + + - **Source Path**: indicates the full HDFS path of a backup file. for example, *Backup path/Backup task name_Task creation time/Version_Data source_Task execution time*\ **.tar.gz**. + - **Source Instance Name**: indicates the name of NameService corresponding to the backup directory when a restoration task is being executed. The default value is **hacluster**. + + c. Click **OK**. + +#. Execute the restoration task. + + In the restoration task list, locate the row where the created task resides, and click **Start** in the **Operation** column. + + - After the restoration is successful, the progress bar is in green. + - After the restoration is successful, the restoration task cannot be executed again. + - If the restoration task fails during the first execution, rectify the fault and try to execute the task again by clicking **Start**. + +#. Determine what metadata has been restored. + + - If the OMS and LdapServer metadata is restored, go to :ref:`7 `. + - If DBService data is restored, no further action is required. + - Restore NameNode data. On MRS Manager, choose **Services > HDFS > More > Restart Service**. The task is complete. + +#. .. _mrs_01_0555__en-us_topic_0035271577_li3654235411916: + + Restarting Manager for the recovered data to take effect + + a. In MRS Manager, Choose **LdapServer** > **More** > **Restart Service** and click **OK**. Wait until the LdapServer service is restarted successfully. + + b. Log in to the active management node. For details, see :ref:`Determining Active and Standby Management Nodes of Manager `. + + c. Run the following command to restart OMS: + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/restart-oms.sh** + + The command has been executed successfully if the following information is displayed: + + .. code-block:: + + start HA successfully. + + d. On MRS Manager, choose **KrbServer > More > Synchronize Configuration**. Do not select Restart the services and instances whose configuration has expired. Click **OK** and wait until the KrbServer service configuration is synchronized and restarted successfully. + + e. Choose **Services > More > Synchronize Configuration**. Do not select Restart the services and instances whose configuration has expired. Click **OK** and wait until the cluster is configured and synchronized successfully. + + f. Choose **Services > More > Stop Cluster**. After the cluster is stopped, choose **Services > More> Start Cluster**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/viewing_backup_and_restoration_tasks.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/viewing_backup_and_restoration_tasks.rst new file mode 100644 index 0000000..d14da99 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/backup_and_restoration/viewing_backup_and_restoration_tasks.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_0559.html + +.. _mrs_01_0559: + +Viewing Backup and Restoration Tasks +==================================== + +Scenario +-------- + +This section describes how to view created backup and restoration tasks and check their running status on MRS Manager. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. Click **Back Up Data** or **Restore Data**. + +#. In the task list, obtain the previous execution result in the **Task Progress** column. Green indicates that the task is executed successfully, and red indicates that the execution fails. + +#. In the **Operation** column of a specified task in the task list, choose **More** > **View History** to view the historical record of backup and restoration execution. + + In the displayed window, click **View** in the **Details** column. The task execution logs and paths are displayed. + +Related Tasks +------------- + +- Modifying a backup task + + For details, see :ref:`Modifying a Backup Task `. + +- Viewing a restoration task + + In the **Operation** column of the specified task in the task list, click **View Details** to view the restoration task. You can only view but cannot modify the parameters of a restoration task. + +- Executing a backup or restoration task + + In the task list, locate a specified task and click **Start** in the **Operation** column to start a backup or restoration task that is ready or fails to be executed. Executed restoration tasks cannot be repeatedly executed. + +- Stopping backup tasks + + In the task list, locate a specified task and click **More** > **Stop** in the **Operation** column to stop a backup task that is running. + +- Deleting a backup or restoration task + + In the **Operation** column of the specified task in the task list, choose **More** > **Delete** to delete the backup or restoration task. After a task is deleted, the backup data is retained by default. + +- Suspending a backup task + + In the **Operation** column of the specified task in the task list, choose **More** > **Suspend** to suspend the backup task. Only periodic backup tasks can be suspended. Suspended backup tasks are no longer executed automatically. When you suspend a backup task that is being executed, the task execution stops. To cancel the suspension status of a task, click **More** > **Resume**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/checking_running_tasks.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/checking_running_tasks.rst new file mode 100644 index 0000000..ce79e95 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/checking_running_tasks.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0105.html + +.. _mrs_01_0105: + +Checking Running Tasks +====================== + +Scenario +-------- + +When you perform operations on MRS Manager to trigger a task, the task execution process and progress are displayed. After the task window is closed, you need to open the task window by using the task management function. + +MRS Manager reserves 10 latest tasks by default, for example, restarting services, synchronizing service configurations, and performing health check. + +Procedure +--------- + +#. On MRS Manager, click |image1| to open the task list. + + You can view the following information in the task list: **Name**, **Status**, **Progress**, **Start Time** and **End Time**. + +#. Click the target task name to view the detailed information about the running task. + +.. |image1| image:: /_static/images/en-us_image_0000001295738144.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/configuring_the_number_of_health_check_reports_to_be_reserved.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/configuring_the_number_of_health_check_reports_to_be_reserved.rst new file mode 100644 index 0000000..edff2ad --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/configuring_the_number_of_health_check_reports_to_be_reserved.rst @@ -0,0 +1,25 @@ +:original_name: mrs_01_0277.html + +.. _mrs_01_0277: + +Configuring the Number of Health Check Reports to Be Reserved +============================================================= + +Scenario +-------- + +Health check reports of MRS clusters, services, and hosts may vary with the time and scenario. You can modify the number of health check reports to be reserved on MRS Manager for later comparison. + +This setting is valid for health check reports of clusters, services, and hosts. Report files are saved in **$BIGDATA_DATA_HOME/Manager/healthcheck** on the active management node by default and are automatically synchronized to the standby management node. + +Prerequisites +------------- + +Users have specified service requirements and planned the save time and health check frequency, and the disk space of the active and standby management nodes is sufficient. + +Procedure +--------- + +#. Choose **System** > **Check Health Status** > **Configure Health Check**. +#. Set **Max. Number of Health Check Reports** to the number of health check reports to be reserved. The value ranges from 1 to 100. The default value is 50. +#. Click **OK** to save the settings. The **Health check configuration saved successfully** is displayed in the upper right corner. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/dbservice_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/dbservice_health_check_indicators.rst new file mode 100644 index 0000000..3ec7c78 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/dbservice_health_check_indicators.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0279.html + +.. _mrs_01_0279: + +DBService Health Check Indicators +================================= + +Service Health Check +-------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the DBService service status is normal. If the status is abnormal, the service is unhealthy. + +**Handling method**: If the indicator is abnormal, rectify the fault by referring to ALM-27001. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist on the host. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/flume_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/flume_health_check_indicators.rst new file mode 100644 index 0000000..ac2fcff --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/flume_health_check_indicators.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0280.html + +.. _mrs_01_0280: + +Flume Health Check Indicators +============================= + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Flume service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, rectify the fault by referring to ALM-24000. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist on the host. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hbase_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hbase_health_check_indicators.rst new file mode 100644 index 0000000..6fe9437 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hbase_health_check_indicators.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0281.html + +.. _mrs_01_0281: + +HBase Health Check Indicators +============================= + +Normal RegionServer Count +------------------------- + +**Indicator**: Normal RegionServer Count + +**Description**: This indicator is used to check the number of RegionServers that are running properly in an HBase cluster. + +**Recovery Guide**: If the indicator is abnormal, check whether the status of RegionServer is normal. If the status is abnormal, resolve the problem and check that the network is normal. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the HBase service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, check whether the status of HMaster and RegionServer is normal. If the status is abnormal, resolve the problem. Then, check whether the status of the ZooKeeper service is faulty. On the HBase client, check whether the data in the HBase table can be correctly read and locate the data reading failure cause. Handle the alarm following instructions in the alarm processing document. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hdfs_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hdfs_health_check_indicators.rst new file mode 100644 index 0000000..4e72c9d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hdfs_health_check_indicators.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0284.html + +.. _mrs_01_0284: + +HDFS Health Check Indicators +============================ + +Average Packet Sending Time +--------------------------- + +**Indicator**: Average Packet Sending Time + +**Description**: This indicator is used to collect statistics on the average time for the DataNode in the HDFS to execute SendPacket each time. If the average time is greater than 2,000,000 ns, the DataNode is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, check whether the network speed of the cluster is normal and whether the memory or CPU usage is too high. Check whether the HDFS load in the cluster is high. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the HDFS service status is normal. If a node is faulty, the host is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, check whether the health status of the KrbServer, LdapServer and ZooKeeper services are faulty. If yes, rectify the fault. Then, check whether the file writing failure is caused by HDFS SafeMode ON. Use the client to check whether data cannot be written into HDFS and locate the cause of the HDFS data writing failure. Handle the alarm following instructions in the alarm processing document. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hive_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hive_health_check_indicators.rst new file mode 100644 index 0000000..7b882b4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/hive_health_check_indicators.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_0285.html + +.. _mrs_01_0285: + +Hive Health Check Indicators +============================ + +Maximum Number of Sessions Allowed by HiveServer +------------------------------------------------ + +**Indicator**: Maximum Number of Sessions Allowed by HiveServer + +**Description**: This indicator is used to check the maximum number of sessions that can be connected to Hive. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Number of Sessions Connected to HiveServer +------------------------------------------ + +**Indicator**: Number of Sessions Connected to HiveServer + +**Description**: This indicator is used to check the number of Hive connections. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Hive service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist on the host. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/host_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/host_health_check_indicators.rst new file mode 100644 index 0000000..397187f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/host_health_check_indicators.rst @@ -0,0 +1,359 @@ +:original_name: mrs_01_0282.html + +.. _mrs_01_0282: + +Host Health Check Indicators +============================ + +Swap Usage +---------- + +**Indicator**: Swap Usage + +**Description**: Swap usage of the system. The value is calculated using the following formula: Swap usage = Used swap size/Total swap size. Assume that the current threshold is set to 75.0%. If the usage of the file handles in the system exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: + +#. Check the swap usage of the node. + + Log in to the unhealthy node and run the **free -m** command to check the total swap space and used swap space. If the swap space usage exceeds the threshold, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li6576720111231: + + If the swap usage exceeds the threshold, you are advised to expand the system capacity, for example, add nodes. + +Host File Handle Usage +---------------------- + +**Indicator**: Host File Handle Usage + +**Description**: This indicator indicates the file handle usage in the system. Host file handle usage = Number of used handles/Total number of handles. If the usage exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: + +#. Check the file handle usage of the host. + + Log in to the unhealthy node and run the **cat /proc/sys/fs/file-nr** command. In the command output, the first and third columns indicate the number of used handles and the total number of handles, respectively. If the usage exceeds the threshold, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li3870258111327: + + If the file handle usage of the host exceeds the threshold, you are advised to check the system and analyze the file handle usage. + +NTP Offset +---------- + +**Indicator**: NTP Offset + +**Description**: This indicator indicates the NTP time offset. If the time deviation exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: + +#. Check the NTP time offset. + + Log in to the unhealthy node and run the **/usr/sbin/ntpq -np** command to view the information. In the command output, the **Offset** column indicates the time offset. If the time offset is greater than the threshold, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li4827260411417: + + If the indicator is abnormal, check whether the clock source configuration is correct. Contact O&M personnel. + +Average Load +------------ + +**Indicator**: Average Load + +**Description**: Average system load, indicating the average number of processes in the running queue in a specified period. The system average load is calculated using the load value obtained by the uptime command. Calculation method: (Load of 1 minute + Load of 5 minutes + Load of 15 minutes)/(3 x Number of CPUs). Assume that the current threshold is set to 2. If the average load exceeds 2, the system is unhealthy. + +**Recovery Guide**: + +#. Log in to the unhealthy node and run the **uptime** command. The last three columns in the command output indicate the load in 1 minute, 5 minutes, and 15 minutes, respectively. If the average system load exceeds the threshold, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li1216080711528: + + If the system average load exceeds the threshold, you are advised to perform system capacity expansion, such as adding nodes. + +D State Process +--------------- + +**Indicator**: D State Process + +**Description**: This indicator indicates the unstoppable sleep process, that is, the process in the D state. A process that is in the D state is waiting for I/O, such as disk I/O and network I/O, and experiences an I/O exception. If any process in the D state exists in the system, the system is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm by referring to ALM-12028. + +Hardware Status +--------------- + +**Indicator**: Hardware Status + +**Description**: This indicator is used to check the system hardware status, including the CPU, memory, disk, power supply, and fan. This indicator obtains related hardware information using **ipmitool sdr elist**. If the hardware status is abnormal, the hardware is unhealthy. + +**Recovery Guide**: + +#. Log in to the node where the check result is unhealthy. Run the **ipmitool sdr elist** command to check system hardware status. The last column in the command output indicates the hardware status. If the status is included in the following fault description table, the check result is unhealthy. + + +-----------------------------------+--------------------------------------------+ + | Module | Symptom | + +===================================+============================================+ + | Processor | IERR | + | | | + | | Thermal Trip | + | | | + | | FRB1/BIST failure | + | | | + | | FRB2/Hang in POST failure | + | | | + | | FRB3/Processor startup/init failure | + | | | + | | Configuration Error | + | | | + | | SM BIOS Uncorrectable CPU-complex Error | + | | | + | | Disabled | + | | | + | | Throttled | + | | | + | | Uncorrectable machine check exception | + +-----------------------------------+--------------------------------------------+ + | Power Supply | Failure detected | + | | | + | | Predictive failure | + | | | + | | Power Supply AC lost | + | | | + | | AC lost or out-of-range | + | | | + | | AC out-of-range, but present | + | | | + | | Config Error: Vendor Mismatch | + | | | + | | Config Error: Revision Mismatch | + | | | + | | Config Error: Processor Missing | + | | | + | | Config Error: Power Supply Rating Mismatch | + | | | + | | Config Error: Voltage Rating Mismatch | + | | | + | | Config Error | + +-----------------------------------+--------------------------------------------+ + | Power Unit | 240VA power down | + | | | + | | Interlock power down | + | | | + | | AC lost | + | | | + | | Soft-power control failure | + | | | + | | Failure detected | + | | | + | | Predictive failure | + +-----------------------------------+--------------------------------------------+ + | Memory | Uncorrectable ECC | + | | | + | | Parity | + | | | + | | Memory Scrub Failed | + | | | + | | Memory Device Disabled | + | | | + | | Correctable ECC logging limit reached | + | | | + | | Configuration Error | + | | | + | | Throttled | + | | | + | | Critical Overtemperature | + +-----------------------------------+--------------------------------------------+ + | Drive Slot | Drive Fault | + | | | + | | Predictive Failure | + | | | + | | Parity Check In Progress | + | | | + | | In Critical Array | + | | | + | | In Failed Array | + | | | + | | Rebuild In Progress | + | | | + | | Rebuild Aborted | + +-----------------------------------+--------------------------------------------+ + | Battery | Low | + | | | + | | Failed | + +-----------------------------------+--------------------------------------------+ + +#. If the indicator is abnormal, contact O&M personnel. + +Host Name +--------- + +**Indicator**: Host Name + +**Description**: This indicator is used to check whether the host name is set. If the host name is not set, the system is unhealthy. If the indicator is abnormal, you are advised to set the host name properly. + +**Recovery Guide**: + +#. Log in to the node where the check result is unhealthy. + +#. Run the hostname host name command to change the host name to ensure that the host name is consistent with the planned host name. + + **hostname**\ *host name* For example, to change the host name to **Bigdata-OM-01**, run the **hostname Bigdata-OM-01** command. + +#. Modify the host name configuration file. + + Run the **vi /etc/HOSTNAME** command to edit the file. Change the file content to **Bigdata-OM-01**. Save the file, and exit. + +Umask +----- + +**Indicator**: Umask + +**Description**: This indicator is used to check whether the umask setting of user **omm** is correct. If Umask is not 0077, the system is unhealthy. + +**Recovery Guide**: + +#. If the indicator is abnormal, you are advised to set umask of user **omm** to 0077. Log in to the unhealthy node and run the **su - omm** command to switch to user **omm**. +#. Run the **vi ${BIGDATA_HOME}/.om_profile** command and change the value of **umask** to **0077**. Save and exit. + +OMS HA Status +------------- + +**Indicator**: OMS HA Status + +**Description**: This indicator is used to check whether the OMS two-node cluster resources are normal. You can run the **${CONTROLLER_HOME}/sbin/status-oms.sh** command to view the detailed information about the status of the OMS two-node cluster resources. If any module is abnormal, the OMS is unhealthy. + +**Recovery Guide**: + +#. Log in to the active management node and run the **su - omm** command to switch to user **omm**. Run the **${CONTROLLER_HOME}/sbin/status-oms.sh** command to check the OMS status. + +#. If floatip, okerberos, and oldap are abnormal, handle the problems by referring to ALM-12002, ALM-12004, and ALM-12005 respectively. + +#. If other resources are abnormal, you are advised to view the logs of the faulty modules. + + If controller resources are abnormal, view **/var/log/Bigdata/controller/controller.log** of the faulty node. + + If CEP resources are abnormal, view **/var/log/Bigdata/omm/oms/cep/cep.log** of the faulty node. + + If AOS resources are abnormal, view **/var/log/Bigdata/controller/aos/aos.log** of the faulty node. + + If feed_watchdog resources are abnormal, view **/var/log/Bigdata/watchdog/watchdog.log** of the abnormal node. + + If HTTPD resources are abnormal, view **/var/log/Bigdata/httpd/error_log** of the abnormal node. + + If FMS resources are abnormal, view **/var/log/Bigdata/omm/oms/fms/fms.log** of the abnormal node. + + If PMS resources are abnormal, view **/var/log/Bigdata/omm/oms/pms/pms.log** of the abnormal node. + + If IAM resources are abnormal, view **/var/log/Bigdata/omm/oms/iam/iam.log** of the abnormal node. + + If the GaussDB resource is abnormal, check the **/var/log/Bigdata/omm/oms/db/omm_gaussdba.log** of the abnormal node. + + If NTP resources are abnormal, view **/var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log** of the abnormal node. + + If Tomcat resources are abnormal, view **/var/log/Bigdata/tomcat/catalina.log** of the abnormal node. + +#. If the fault cannot be rectified based on the logs, contact O&M personnel and send the collected fault logs. + +Checking the Installation Directory and Data Directory +------------------------------------------------------ + +**Indicator**: Installation Directory and Data Directory Check + +**Description**: This indicator checks the **lost+found** directory in the root directory of the disk partition where the installation directory (**/opt/Bigdata** by default) is located. If the directory contains the files of user **omm**, there are exceptions. When a node is abnormal, related files are stored in the **lost+found** directory. This indicator is used to check whether files are lost in such scenarios. Check the installation directory (for example, **/opt/Bigdata**) and data directory (for example, **/srv/BigData**). If any files of non-omm users exist in the two directories, the system is unhealthy. + +**Recovery Guide**: + +#. Log in to the unhealthy node and run the **su - omm** command to switch to user **omm**. Check whether files or folders of user omm exist in the **lost+found** directory. + + If the **omm** user file exists, you are advised to restore it and check again. If the **omm** user file does not exist, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li557697581195: + + Check the installation directory and data directory. Check whether the files or folders of other users exist in the installation directory and data directory. If the files and folders are manually generated temporary files, you are advised to delete them and check again. + +CPU Usage +--------- + +**Indicator**: CPU Usage + +**Description**: This indicator is used to check whether the CPU usage exceeds the threshold. If the disk usage exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm by referring to ALM-12016. + +Memory Usage +------------ + +**Indicator**: Memory Usage + +**Description**: This indicator is used to check whether the memory usage exceeds the threshold. If the disk usage exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm by referring to ALM-12018. + +Host Disk Usage +--------------- + +**Indicator**: Host Disk Usage + +**Description**: This indicator is used to check whether the host disk usage exceeds the threshold. If the disk usage exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, the system generates an alarm. You are advised to handle the alarm by referring to ALM-12017. + +Host Disk Write Rate +-------------------- + +**Indicator**: Host Disk Write Rate + +**Description**: This indicator is used to check the disk write rate of a host. The write rate of the host disk may vary according to the service scenario. Therefore, the value of this indicator reflects only the specified value. You need to determine whether the indicator is normal in specified service scenarios. + +**Recovery Guide**: Determine whether the current disk write rate is normal based on the service scenario. + +Host Disk Read Rate +------------------- + +**Indicator**: Host Disk Read Rate + +**Description**: This indicator is used to check the disk read rate of a host. The read rate of the host disk may vary by service scenario. Therefore, the value of this indicator reflects only the specified value. You need to determine whether the indicator is normal in specified service scenarios. + +**Recovery Guide**: Determine whether the current disk read rate is normal based on the service scenario. + +Host Service Plane Network Status +--------------------------------- + +**Indicator**: Host Service Plane Network Status + +**Description**: This indicator is used to check the connectivity of the service plane network of the cluster host. If the hosts are disconnected, the cluster is unhealthy. + +**Recovery Guide**: If the single-plane networking is used, check the IP address of the single plane. For a dual-plane network, the operation procedure is as follows: + +#. Check the network connectivity between the service plane IP addresses of the active and standby management nodes. + + If the network is abnormal, go to :ref:`3 `. + + If the network is normal, go to :ref:`2 `. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li12524768111343: + + Check the network connectivity between the IP address of the active management node and the IP address of the abnormal node in the cluster. + +#. .. _mrs_01_0282__en-us_topic_0036074522_li45614056111343: + + If the network is disconnected, contact O&M personnel to rectify the network fault to ensure that the network meets service requirements. + +Host Status +----------- + +**Indicator**: Host Status + +**Description**: This indicator is used to check whether the host status is normal. If a node is faulty, the host is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, rectify the fault by referring to ALM-12006. + +Alarm Check +----------- + +**Indicator**: Alarm Check + +**Description**: This indicator is used to check whether alarms exist on the host. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/index.rst new file mode 100644 index 0000000..e67fd20 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/index.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_0271.html + +.. _mrs_01_0271: + +Health Check Management +======================= + +- :ref:`Performing a Health Check ` +- :ref:`Viewing and Exporting a Health Check Report ` +- :ref:`Configuring the Number of Health Check Reports to Be Reserved ` +- :ref:`Managing Health Check Reports ` +- :ref:`DBService Health Check Indicators ` +- :ref:`Flume Health Check Indicators ` +- :ref:`HBase Health Check Indicators ` +- :ref:`Host Health Check Indicators ` +- :ref:`HDFS Health Check Indicators ` +- :ref:`Hive Health Check Indicators ` +- :ref:`Kafka Health Check Indicators ` +- :ref:`KrbServer Health Check Indicators ` +- :ref:`LdapServer Health Check Indicators ` +- :ref:`Loader Health Check Indicators ` +- :ref:`MapReduce Health Check Indicators ` +- :ref:`OMS Health Check Indicators ` +- :ref:`Spark Health Check Indicators ` +- :ref:`Storm Health Check Indicators ` +- :ref:`Yarn Health Check Indicators ` +- :ref:`ZooKeeper Health Check Indicators ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + performing_a_health_check + viewing_and_exporting_a_health_check_report + configuring_the_number_of_health_check_reports_to_be_reserved + managing_health_check_reports + dbservice_health_check_indicators + flume_health_check_indicators + hbase_health_check_indicators + host_health_check_indicators + hdfs_health_check_indicators + hive_health_check_indicators + kafka_health_check_indicators + krbserver_health_check_indicators + ldapserver_health_check_indicators + loader_health_check_indicators + mapreduce_health_check_indicators + oms_health_check_indicators + spark_health_check_indicators + storm_health_check_indicators + yarn_health_check_indicators + zookeeper_health_check_indicators diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/kafka_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/kafka_health_check_indicators.rst new file mode 100644 index 0000000..2ef4466 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/kafka_health_check_indicators.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0288.html + +.. _mrs_01_0288: + +Kafka Health Check Indicators +============================= + +Number of Available Broker Nodes +-------------------------------- + +**Indicator**: Number of Brokers + +**Description**: This indicator is used to check the number of available Broker nodes in a cluster. If the number of available Broker nodes in a cluster is less than 2, the cluster is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, go to the Kafka service instance page and click the host name of the unavailable Broker instance. View the host health status in the **Overview** area. If the host health status is **Good**, rectify the fault by referring to the alarm handling suggestions in **Process Fault**. If the status is not **Good**, rectify the fault by referring to the handling procedure of the **Node Fault** alarm. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Kafka service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, rectify the fault by referring to the alarm "Kafka Service Unavailable". + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/krbserver_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/krbserver_health_check_indicators.rst new file mode 100644 index 0000000..231b2ec --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/krbserver_health_check_indicators.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_0289.html + +.. _mrs_01_0289: + +KrbServer Health Check Indicators +================================= + +KerberosAdmin Service Availability +---------------------------------- + +**Indicator**: KerberosAdmin Service Availability + +**Description**: The system checks the KerberosAdmin service status. If the check result is abnormal, the KerberosAdmin service is unavailable. + +**Recovery Guide**: If the indicator check result is abnormal, the possible cause is that the node where the KerberosAdmin service is located is faulty or the SlapdServer service is unavailable. During the KerberosAdmin service recovery, try the following operations: + +#. Check whether the node where the KerberosAdmin service locates is faulty. +#. Check whether the SlapdServer service is unavailable. + +KerberosServer Service Availability +----------------------------------- + +**Indicator**: KerberosServer Service Availability + +**Description**: The system checks the KerberosServer service status. If the check result is abnormal, the KerberosServer service is unavailable. + +**Recovery Guide**: If the indicator check result is abnormal, the possible cause is that the node where the KerberosServer service is located is faulty or the SlapdServer service is unavailable. During the KerberosServer service recovery, try the following operations: + +#. Check whether the node where the KerberosServer service locates is faulty. +#. Check whether the SlapdServer service is unavailable. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: The system checks the KrbServer service status. If the check result is abnormal, the KrbServer service is unavailable. + +**Recovery Guide**: If the indicator check result is abnormal, the possible cause is that the node where the KrbServer service resides is faulty or the LdapServer service is unavailable. For details, see the handling procedure of ALM-25500. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check the alarm information about the KrbServer service. If any alarms exist, the KrbServer service may be abnormal. + +**Recovery Guide**: If this indicator check result is abnormal, see the related alarm document to handle the alarms. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/ldapserver_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/ldapserver_health_check_indicators.rst new file mode 100644 index 0000000..2d95e03 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/ldapserver_health_check_indicators.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0291.html + +.. _mrs_01_0291: + +LdapServer Health Check Indicators +================================== + +SlapdServer Service Availability +-------------------------------- + +**Indicator**: SlapdServer Service Availability + +**Description**: The system checks the SlapdServer service status. If the status is abnormal, the SlapdServer service is unavailable. + +**Recovery Guide**: If the indicator check result is abnormal, the possible cause is that the node where the SlapdServer service is located is faulty or the SlapdServer process is faulty. During the SlapdServer service recovery, try the following operations: + +#. Check whether the node where the SlapdServer service locates is faulty. For details, see ALM-12006. +#. Check whether the SlapdServer process is normal. For details, see ALM-12007. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check the alarm information about the LdapServer service. If the status is abnormal, the LdapServer service is unavailable. + +**Recovery Guide**: If the indicator check result is abnormal, the possible cause is that the node where the active LdapServer service resides is faulty or the active LdapServer process is faulty. For details, see ALM-25000. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check the alarm information about the LdapServer service. If any alarms exist, the LdapServer service may be abnormal. + +**Recovery Guide**: If this indicator check result is abnormal, see the related alarm document to handle the alarms. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/loader_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/loader_health_check_indicators.rst new file mode 100644 index 0000000..07b1993 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/loader_health_check_indicators.rst @@ -0,0 +1,78 @@ +:original_name: mrs_01_0292.html + +.. _mrs_01_0292: + +Loader Health Check Indicators +============================== + +ZooKeeper Health Status +----------------------- + +**Indicator**: ZooKeeper health status + +**Description**: This indicator is used to check whether the ZooKeeper health status is normal. If the status is abnormal, the ZooKeeper service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +HDFS Health Status +------------------ + +**Indicator**: HDFS health status + +**Description**: This indicator is used to check whether the HDFS health status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +DBService Health Status +----------------------- + +**Indicator**: DBService Health Status + +**Description**: This indicator is used to check whether the DBService health status is normal. If the status is abnormal, the DBService service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Yarn Health Status +------------------ + +**Indicator**: Yarn health status + +**Description**: This indicator is used to check whether the Yarn health status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +MapReduce Health Status +----------------------- + +**Indicator**: MapReduce Health Status + +**Description**: This indicator is used to check whether the MapReduce health status is normal. If the status is abnormal, the MapReduce service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Loader Process Status +--------------------- + +**Indicator**: Loader Process Status + +**Description**: This indicator is used to check whether the Loader process is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Loader service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist for loader. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/managing_health_check_reports.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/managing_health_check_reports.rst new file mode 100644 index 0000000..efe7396 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/managing_health_check_reports.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0278.html + +.. _mrs_01_0278: + +Managing Health Check Reports +============================= + +Scenario +-------- + +On MRS Manager, users can manage historical health check reports, for example, viewing, downloading, and deleting historical health check reports. + +Procedure +--------- + +- Download a specified health check report. + + #. Choose **System** > **Check Health Status**. + #. Locate the row that contains the target health check report and click **Download** to download the report file. + +- Download specified health check reports in batches. + + #. Choose **System** > **Check Health Status**. + #. Select multiple health check reports and click **Download File** to download them. + +- Delete a specified health check report. + + #. Choose **System** > **Check Health Status**. + #. Locate the row that contains the target health check report and click **Delete** to delete the report file. + +- Delete specified health check reports in batches. + + #. Choose **System** > **Check Health Status**. + #. Select multiple health check reports and click **Delete File** to delete them. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/mapreduce_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/mapreduce_health_check_indicators.rst new file mode 100644 index 0000000..cdde7c3 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/mapreduce_health_check_indicators.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0293.html + +.. _mrs_01_0293: + +MapReduce Health Check Indicators +================================= + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the MapReduce service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/oms_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/oms_health_check_indicators.rst new file mode 100644 index 0000000..25a8451 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/oms_health_check_indicators.rst @@ -0,0 +1,195 @@ +:original_name: mrs_01_0294.html + +.. _mrs_01_0294: + +OMS Health Check Indicators +=========================== + +OMS Status Check +---------------- + +**Indicator**: OMS Status Check + +**Description**: The OMS status check includes the HA status check and resource status check. The HA status includes **active**, **standby**, and **NULL**, indicating the active node, standby node, and unknown, respectively. The resource status includes normal, abnormal, and NULL. If the HA status is NULL, the HA status is unhealthy. If the resource status is NULL or abnormal, the resource status is unhealthy. + +.. table:: **Table 1** OMS status description + + +-----------------------------------+--------------------------------------------------------+ + | Name | Description | + +===================================+========================================================+ + | HA state | **active**: indicates the active node. | + | | | + | | **standby**: indicates the standby node. | + | | | + | | **NULL**: unknown | + +-----------------------------------+--------------------------------------------------------+ + | Resource status | **normal**: All resources are normal. | + | | | + | | **abnormal**: indicates that abnormal resources exist. | + | | | + | | **NULL**: unknown | + +-----------------------------------+--------------------------------------------------------+ + +**Recovery Guide**: + +#. Log in to the active management node and run the **su - omm** command to switch to user **omm**. Run the **${CONTROLLER_HOME}/sbin/status-oms.sh** command to check the status of OMS. +#. If the HA status is NULL, the system may be restarting. NULL is an intermediate state, and the HA status will automatically change to a normal state. +#. If the resource status is abnormal, certain component resources of FusionInsight Manager are abnormal. Check whether the status of components such as acs, aos cep, controller, feed_watchdog, fms, guassDB, httpd, iam, ntp, okerberos, oldap, pms, and tomcat component is normal. +#. If any Manager component resource is abnormal, see Manager component status check to rectify the fault. + +Manager Component Status Check +------------------------------ + +**Indicator**: Manager Component Status Check + +**Description**: This indicator is used to check the running status and HA status of Manager components. The resource running status includes **Normal** and **Abnormal**, and the resource HA status includes **Normal** and **Exception**. Manager components include Acs, Aos, Cep, Controller, feed_watchdog, Floatip, Fms, GaussDB, HeartBeatCheck, httpd, IAM, NTP, Okerberos, OLDAP, PMS, and Tomcat. If the running status and HA status is not Normal, the check result is unhealthy. + +.. table:: **Table 2** Manager status description + + +-----------------------------------+-----------------------------------------------------------------------+ + | Name | Description | + +===================================+=======================================================================+ + | Resource running status: | **Normal**: The system is running properly. | + | | | + | | **Abnormal**: The running is abnormal. | + | | | + | | **Stopped**: The task is stopped. | + | | | + | | **Unknown**: The status is unknown. | + | | | + | | **Starting**: The process is being started. | + | | | + | | **Stopping**: The task is being stopped. | + | | | + | | **Active_normal**: The active node is running properly. | + | | | + | | **Standby_normal**: The standby node is running properly. | + | | | + | | **Raising_active**: The node is being promoted to be the active node. | + | | | + | | **Lowing_standby**: The node is being set to be the standby node. | + | | | + | | **No_action**: the action does not exist. | + | | | + | | **Repairing**: The disk is being repaired. | + | | | + | | **NULL**: unknown | + +-----------------------------------+-----------------------------------------------------------------------+ + | Resource HA status | **Normal**: the status is normal. | + | | | + | | **Exception**: indicates a fault. | + | | | + | | **Non_steady**: indicates the non-steady state. | + | | | + | | **Unknown**: unknown | + | | | + | | **NULL**: unknown | + +-----------------------------------+-----------------------------------------------------------------------+ + +**Recovery Guide**: + +#. Log in to the active management node and run the **su - omm** command to switch to user **omm**. Run the **${CONTROLLER_HOME}/sbin/status-oms.sh** command to check the status of OMS. + +#. If floatip, okerberos, and oldap are abnormal, handle the problems by referring to ALM-12002, ALM-12004, and ALM-12005 respectively. + +#. If other resources are abnormal, you are advised to view the logs of the faulty modules. + + If controller resources are abnormal, view **/var/log/Bigdata/controller/controller.log** of the faulty node. + + If CEP resources are abnormal, view **/var/log/Bigdata/omm/oms/cep/cep.log** of the faulty node. + + If AOS resources are abnormal, view **/var/log/Bigdata/controller/aos/aos.log** of the faulty node. + + If feed_watchdog resources are abnormal, view **/var/log/Bigdata/watchdog/watchdog.log** of the abnormal node. + + If HTTPD resources are abnormal, view **/var/log/Bigdata/httpd/error_log** of the abnormal node. + + If FMS resources are abnormal, view **/var/log/Bigdata/omm/oms/fms/fms.log** of the abnormal node. + + If PMS resources are abnormal, view **/var/log/Bigdata/omm/oms/pms/pms.log** of the abnormal node. + + If IAM resources are abnormal, view **/var/log/Bigdata/omm/oms/iam/iam.log** of the abnormal node. + + If the GaussDB resource is abnormal, check the **/var/log/Bigdata/omm/oms/db/omm_gaussdba.log** of the abnormal node. + + If NTP resources are abnormal, view **/var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log** of the abnormal node. + + If Tomcat resources are abnormal, view **/var/log/Bigdata/tomcat/catalina.log** of the abnormal node. + +#. If the fault cannot be rectified based on the logs, contact O&M personnel and send the collected fault logs. + +OMA Running Status +------------------ + +**Indicator**: OMA Running Status + +**Description**: This indicator is used to check the running status of the OMA. The status can be **Running** or **Stopped**. If the OMA is **Stopped**, the OMA is unhealthy. + +**Recovery Guide**: + +#. Log in to the unhealthy node and run the **su - omm** command to switch to user **omm**. + +#. Run **${OMA_PATH}/restart_oma_app** to manually start the OMA and check again. If the check result is still unhealthy, go to :ref:`3 `. + +#. .. _mrs_01_0294__en-us_topic_0035251772_li62867080113130: + + If manually starting the OMA cannot resolve the problem, you are advised to check the OMA logs in **/var/log/Bigdata/omm/oma/omm_agent.log**. + +#. If the fault cannot be rectified based on the logs, contact O&M personnel and send the collected fault logs. + +SSH Trust Between Each Node and the Active Management Node +---------------------------------------------------------- + +**Indicator**: SSH Trust Between Each Node and the Active Management Node + +**Description**: This indicator is used to check whether the SSH mutual trust is normal. If you can switch to another node through SSH from the active OMS node as user omm without the need of entering the password, SSH communication is normal. Otherwise, SSH communication is abnormal. In addition, if you can switch to another node through SSH from the active OMS node but fail to switch to the active OMS node from the other nodes, SSH communication is abnormal. + +**Recovery Guide**: + +#. If the indicator check result is abnormal, the SSH trust relationships between the nodes and the active management node are abnormal. In this case, check whether the permission of the **/home/omm** directory is **omm**. If non-omm users have the directory permission, the SSH trust relationship may be abnormal. You are advised to run **chown omm:wheel** to modify the permission and check again. If the permission on the **/home/omm** directory is normal, go to :ref:`2 `. + +#. .. _mrs_01_0294__en-us_topic_0035251772_li39886844113155: + + The SSH trust relationship exception may cause heartbeat exceptions between Controller and NodeAgent, resulting in node fault alarms. In this case, rectify the fault by referring to the handling procedure of ALM-12006. + +Process Running Time +-------------------- + +**Indicator**: Running Time of NodeAgent, Controller, and Tomcat + +**Description**: This indicator is used to check the running time of the NodeAgent, Controller, and Tomcat processes. If the time is less than half an hour (1,800s), the process may have been restarted. You are advised to check the process after half an hour. If multiple check results indicate that the process runs for less than half an hour, the process is abnormal. + +**Recovery Guide**: + +#. Log in to the unhealthy node and run the **su - omm** command to switch to user **omm**. + +#. Run the following command to check the PID based on the process name: + + **ps -ef \| grep NodeAgent** + +#. Run the following command to check the process startup time based on the PID: + + **ps -p pid -o lstart** + +#. Check whether the process start time is normal. If the process restarts repeatedly, go to :ref:`5 `. + +#. .. _mrs_01_0294__en-us_topic_0035251772_li38659710113226: + + View the related logs and analyze restart causes. + + If the runtime of NodeAgent is abnormal, check **/var/log/Bigdata/nodeagent/agentlog/agent.log**. + + If the Controller running time is abnormal, check the **/var/log/Bigdata/controller/controller.log** file. + + If the Tomcat running time is abnormal, check the **/var/log/Bigdata/tomcat/web.log** file. + +#. If the fault cannot be rectified based on the logs, contact O&M personnel and send the collected fault logs. + +Account and Password Expiration Check +------------------------------------- + +**Indicator**: Account and Password Expiration Check + +**Description**: This indicator checks the two operating system users **omm** and **ommdba** of MRS. For OS users, both the account and password expiration time must be checked. If the validity period of the account or password is not greater than 15 days, the account is abnormal. + +**Recovery Guide**: If the validity period of the account or password is less than or equal to 15 days, contact O&M personnel. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/performing_a_health_check.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/performing_a_health_check.rst new file mode 100644 index 0000000..d429e52 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/performing_a_health_check.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_0274.html + +.. _mrs_01_0274: + +Performing a Health Check +========================= + +Scenario +-------- + +To ensure that cluster parameters, configurations, and monitoring are correct and that the cluster can run stably for a long time, you can perform a health check during routine maintenance. + +.. note:: + + A system health check includes MRS Manager, service-level, and host-level health checks: + + - MRS Manager health checks focus on whether the unified management platform can provide management functions. + - Service-level health checks focus on whether components can provide services properly. + - Host-level health checks focus on whether host indicators are normal. + + The system health check includes three types of check items: health status, related alarms, and customized monitoring indicators for each check object. The health check results are not always the same as the **Health Status** on the portal. + +Procedure +--------- + +- Manually perform the health check for all services. + + #. Click **Services** and select the target service. + #. Choose **More** > **Start Service Health Check** to start the health check for the service. + + .. note:: + + - The cluster health check includes Manager, service, and host status checks. + - To perform cluster health checks, you can also choose **System** > **Check Health Check** > **Start Cluster Health Check** on MRS Manager. + - To export the health check result, click **Export Report** in the upper left corner. + +- Manually perform the health check for a service. + + #. Click **Services**. In the services list, click the desired service name. + #. Choose **More** > **Start Service Health Check** to start the health check for the service. + +- Manually perform the health check for a host. + + #. Click **Hosts**. + #. Select the check box of the host for which you want to check the health status. + #. Choose **More** > **Start Host Health Check** to start the health check for the host. + +- Automatically performing a health check + + #. Click **System**. + + #. Click **Check Health Status** under **Maintenance**. + + #. Click **Configure Health Check** to configure automatic health check items. + + **Periodic Health Check**: specifies whether to enable automatic health check. The **Periodic Health Check** function is disabled by default. You can click to enable the function and select **Daily**, **Weekly**, or **Monthly** based on management requirements. + + #. Click **OK** to save the settings. The **Health check configuration saved successfully** is displayed in the upper right corner. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/spark_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/spark_health_check_indicators.rst new file mode 100644 index 0000000..83e151c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/spark_health_check_indicators.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0530.html + +.. _mrs_01_0530: + +Spark Health Check Indicators +============================= + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Spark service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, rectify the fault by referring to ALM-28001. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/storm_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/storm_health_check_indicators.rst new file mode 100644 index 0000000..a8ba08d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/storm_health_check_indicators.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_0531.html + +.. _mrs_01_0531: + +Storm Health Check Indicators +============================= + +Number of Working Nodes +----------------------- + +**Indicator**: Number of Supervisors + +**Description**: This indicator is used to check the number of available Supervisors in a cluster. If the number of available Supervisors in a cluster is less than 1, the cluster is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, go to the Streaming service instance page and click the host name of the unavailable Supervisor instance. View the host health status in the **Overview** area. If the host health status is **Good**, rectify the fault by referring to ALM-12007 Process Faults. If the status is not **Good**, rectify the fault by referring to the handling procedure of the ALM-12006 Node Faults. + +Number of Idle Slots +-------------------- + +**Indicator**: Number of Idle Slots + +**Description**: This indicator is used to check the number of idle slots in a cluster. If the number of idle slots in a cluster is less than 1, the cluster is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, go to the Storm service instance page and check the health status of the Supervisor instance. If the health status of all Supervisor instances is **Good**, you need to expand the capacity of the Core node in the cluster. If not, rectify the fault by referring to ALM-12007 Process Faults. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Storm service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, rectify the fault by referring to the alarm "ALM-26051 Storm Service Unavailable". + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/viewing_and_exporting_a_health_check_report.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/viewing_and_exporting_a_health_check_report.rst new file mode 100644 index 0000000..844d18e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/viewing_and_exporting_a_health_check_report.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0275.html + +.. _mrs_01_0275: + +Viewing and Exporting a Health Check Report +=========================================== + +Scenario +-------- + +You can view the health check result in MRS Manager and export the health check results for further analysis. + +.. note:: + + A system health check includes MRS Manager, service-level, and host-level health checks: + + - MRS Manager health checks focus on whether the unified management platform can provide management functions. + - Service-level health checks focus on whether components can provide services properly. + - Host-level health checks focus on whether host indicators are normal. + + The system health check includes three types of check items: health status, related alarms, and customized monitoring indicators for each check object. The health check results are not always the same as the **Health Status** on the portal. + +Prerequisites +------------- + +You have performed a health check. + +Procedure +--------- + +#. Click **Services**. +#. Choose **More > View Cluster Health Check Report** to view the health check report of a cluster. +#. Click **Export Report** on the health check report pane to export the report and view detailed information about check items. + + .. note:: + + For details about how to rectify the faults of the check items, see :ref:`DBService Health Check Indicators ` to :ref:`ZooKeeper Health Check Indicators `. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/yarn_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/yarn_health_check_indicators.rst new file mode 100644 index 0000000..40f95ec --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/yarn_health_check_indicators.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0532.html + +.. _mrs_01_0532: + +Yarn Health Check Indicators +============================ + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether the Yarn service status is normal. If the number of NodeManager nodes cannot be obtained, the system is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can handle the alarm by referring to the alarm handling guide and make sure that the network is normal. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/zookeeper_health_check_indicators.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/zookeeper_health_check_indicators.rst new file mode 100644 index 0000000..9e8e597 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/health_check_management/zookeeper_health_check_indicators.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_0533.html + +.. _mrs_01_0533: + +ZooKeeper Health Check Indicators +================================= + +Average ZooKeeper Request Processing Latency +-------------------------------------------- + +**Indicator**: Average ZooKeeper Service Request Processing Latency + +**Description**: This indicator is used to check the average delay for the ZooKeeper service to process requests. If the average delay is greater than 300 ms, the ZooKeeper service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, check whether the network speed of the cluster is normal and whether the memory or CPU usage is too high. + +ZooKeeper Connections Usage +--------------------------- + +**Indicator**: ZooKeeper Connections Usage + +**Description**: This indicator is used to check whether the ZooKeeper memory usage exceeds 80%. If the disk usage exceeds the threshold, the system is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, you are advised to increase the memory available for the ZooKeeper service. The method of increasing the memory is as follows: Increase the value of **-Xmx** in the **GC_OPTS** configuration item in the ZooKeeper service. After the modification, restart the ZooKeeper service for the configuration to take effect. + +Service Health Status +--------------------- + +**Indicator**: Service Status + +**Description**: This indicator is used to check whether ZooKeeper service status is normal. If the status is abnormal, the service is unhealthy. + +**Recovery Guide**: If the indicator is abnormal, check whether the health status of the KrbServer and LdapServer services is faulty. If yes, rectify the fault. Log in to the ZooKeeper client, check whether the ZooKeeper data writing fails. If yes, find the failure cause based on the error message and handle the fault according to error message. Rectify the fault by following the procedure for handling ALM-13000. + +Alarm Check +----------- + +**Indicator**: Alarm Information + +**Description**: This indicator is used to check whether alarms exist. If alarms exist, the service is unhealthy. + +**Recovery Guide**: If this indicator is abnormal, you can rectify the fault by referring to the alarm handling guide. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/index.rst new file mode 100644 index 0000000..89ae4a2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/index.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_0648.html + +.. _mrs_01_0648: + +MRS Manager Operation Guide (Applicable to 2.x and Earlier Versions) +==================================================================== + +- :ref:`Introduction to MRS Manager ` +- :ref:`Checking Running Tasks ` +- :ref:`Monitoring Management ` +- :ref:`Alarm Management ` +- :ref:`Alarm Reference (Applicable to Versions Earlier Than MRS 3.x) ` +- :ref:`Object Management ` +- :ref:`Log Management ` +- :ref:`Health Check Management ` +- :ref:`Static Service Pool Management ` +- :ref:`Tenant Management ` +- :ref:`Backup and Restoration ` +- :ref:`Security Management ` +- :ref:`Permissions Management ` +- :ref:`MRS Multi-User Permission Management ` +- :ref:`Patch Operation Guide ` +- :ref:`Restoring Patches for the Isolated Hosts ` +- :ref:`Rolling Restart ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction_to_mrs_manager + checking_running_tasks + monitoring_management/index + alarm_management/index + alarm_reference_applicable_to_versions_earlier_than_mrs_3.x/index + object_management/index + log_management/index + health_check_management/index + static_service_pool_management/index + tenant_management/index + backup_and_restoration/index + security_management/index + permissions_management/index + mrs_multi-user_permission_management/index + patch_operation_guide/index + restoring_patches_for_the_isolated_hosts + rolling_restart diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/introduction_to_mrs_manager.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/introduction_to_mrs_manager.rst new file mode 100644 index 0000000..addd72d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/introduction_to_mrs_manager.rst @@ -0,0 +1,103 @@ +:original_name: mrs_01_0101.html + +.. _mrs_01_0101: + +Introduction to MRS Manager +=========================== + +Overview +-------- + +MRS manages and analyzes massive data and helps you rapidly obtain desired data from structured and unstructured data. The structure of open-source components is complex. The installation, configuration, and management processes are time- and labor-consuming. MRS Manager is a unified enterprise-level cluster management platform and provides the following functions: + +- Cluster monitoring enables you to quickly view the health status of hosts and services. +- Graphical metric monitoring and customization enable you to quickly obtain key information about the system. +- Service property configurations can meet service performance requirements. +- With cluster, service, and role instance functions, you can start or stop services and clusters in one click. + +Introduction to the MRS Manager GUI +----------------------------------- + +MRS Manager provides a unified cluster management platform, facilitating rapid and easy O&M for clusters. For details about how to access MRS Manager, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +:ref:`Table 1 ` describes the functions of each operation entry. + +.. _mrs_01_0101__en-us_topic_0035209593_table13549662121428: + +.. table:: **Table 1** Functions of each entry on the operation bar + + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Function | + +===========+==================================================================================================================================================================================================================================================================================================================================+ + | Dashboard | Displays the status of all services, main monitoring indicators of each service, and host status in charts, such as bar charts, line charts, and tables. You can customize a dashboard for the key monitoring indicators and drag it to any position on the interface. The system dashboard page supports automatic data update. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Services | Provides the service monitoring, operation, and configuration guidance, which helps you manage services in a unified manner. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hosts | Provides guidance on how to monitor, operate, and configure hosts, helping you manage hosts in a unified manner. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Alarms | Supports alarm query and provides guidance on alarm handling, helping you identify and rectify product faults and potential risks in a timely manner to ensure normal system operation. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit | Allows authorized users to query and export audit logs, helping you to view all user activities and operations. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant | Provides a unified tenant management platform. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | System | Provides monitoring, alarm configuration management, and backup management. | + +-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Go to the **System** tab page, and switch to another function pages through shortcuts. See :ref:`Table 2 `. + +The following is an example of quick redirection through shortcuts: + +#. On MRS Manager, click **System**. + +#. On the **System** tab page, click a function link. The function page is displayed. + + For example, in the **Backup and Restoration** area, click **Back Up Data**. The page for backing up data is displayed. + +#. Move the cursor to the left border of the browser window. The **System** black shortcut menu is displayed. After you move the cursor out of the menu, the menu is collapsed. + +#. In the shortcut menu that is displayed, you can click a function link to go to the corresponding function page. + + For example, choose **Maintenance > Export Log**. The page for exporting logs is displayed. + +.. _mrs_01_0101__en-us_topic_0035209593_table5212148312126: + +.. table:: **Table 2** Shortcut menus on the **System** tab page + + ====================== ======================================= + Menu Function Link + ====================== ======================================= + Backup and Restoration Back Up Data + \ Restore Data + Maintenance Export Log + \ Export Audit Log + \ Check Health Status + Monitoring and Alarm Configure Syslog + \ Configure Alarm Threshold + \ Configure SNMP + \ Configure Monitoring Metric Dump + \ Configure Resource Contribution Ranking + Permission Manage User + \ Manage User Group + \ Manage Role + \ Configure Password Policy + \ Change OMS Database Password + Patch Manage Patch + ====================== ======================================= + +Reference +--------- + +MapReduce Service (MRS) is a data analysis service on the public cloud. It is used to manage and analyze massive sets of data. + +MRS uses MRS Manager to manage big data components, such as components in the Hadoop ecosystem. Therefore, some concepts on the MRS Console on the public cloud must be different from those on MRS Manager. For details, see :ref:`Table 3 `. + +.. _mrs_01_0101__en-us_topic_0035209593_table39303837105524: + +.. table:: **Table 3** Difference Comparison + + +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ + | Concept | Public Cloud MRS | MRS Manager | + +===================+=============================================================================================================================================================+====================================================================================+ + | MapReduce Service | Indicates the data analysis cloud service on the public cloud, called MRS. This service includes components such as Hive, Spark, Yarn, HDFS, and ZooKeeper. | Provides a unified management platform for big data components in tenant clusters. | + +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/about_logs.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/about_logs.rst new file mode 100644 index 0000000..b71601a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/about_logs.rst @@ -0,0 +1,641 @@ +:original_name: mrs_01_1226.html + +.. _mrs_01_1226: + +About Logs +========== + +Log Description +--------------- + +MRS cluster logs are stored in the **/var/log/Bigdata** directory. The following table lists the log types. + +.. table:: **Table 1** Log types + + +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Description | + +==================+===================================================================================================================================================================================================+ + | Installation log | Installation logs record information about FusionInsight Manager, cluster, and service installation to help users locate installation errors. | + +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Run logs | Run logs record the running track information, debugging information, status changes, potential problems, and error information generated during the running of services. | + +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit logs | Audit logs record information about users' activities and operation instructions, which can be used to locate fault causes in security events and determine who are responsible for these faults. | + +------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +The following table lists the MRS log directories. + +.. table:: **Table 2** Log directories + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | File Directory | Log Content | + +===================================+======================================================================================================================================================================+ + | /var/log/Bigdata/audit | Component audit log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/controller | Log collecting script log. | + | | | + | | Controller process log. | + | | | + | | Controller monitoring log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/dbservice | DBService log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/flume | Flume log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hbase | HBase log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hdfs | HDFS log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hive | Hive log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/httpd | HTTPD log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/hue | Hue log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/kerberos | Kerberos log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/ldapclient | LDAP client log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/ldapserver | LDAP server log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/loader | Loader log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/logman | logman script log management log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/mapreduce | MapReduce log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/nodeagent | NodeAgent log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/okerberos | OMS Kerberos log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/oldapserver | OMS LDAP log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/omm | **oms**: complex event processing log, alarm service log, HA log, authentication and authorization management log, and monitoring service run log of the omm server. | + | | | + | | **oma**: installation log and run log of the omm agent. | + | | | + | | **core**: dump log generated when the omm agent and the HA process are suspended. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/spark | Spark log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/sudo | Log generated when the **sudo** command is executed by user **omm**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/timestamp | Time synchronization management log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/tomcat | Tomcat log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/yarn | Yarn log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/zookeeper | ZooKeeper log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/kafka | Kafka log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/storm | Storm log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | /var/log/Bigdata/patch | Patch log. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Run logs +-------- + +:ref:`Table 3 ` describes the running information recorded in run logs. + +.. _mrs_01_1226__t6d0bc48e23fc402ba1643b1dcd9f77c4: + +.. table:: **Table 3** Running information + + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Run Log | Description | + +=================================+===========================================================================================================================================================================+ + | Installation preparation log | Records information about preparations for the installation, such as the detection, configuration, and feedback operation information. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process startup log | Records information about the commands executed during the process startup. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process startup exception log | Records information about exceptions during process startup, such as dependent service errors and insufficient resources. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process run log | Records information about the process running track information and debugging information, such as function entries and exits as well as cross-module interface messages. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process running exception log | Records errors that cause process running errors, for example, the empty input objects or encoding or decoding failure. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Process running environment log | Records information about the process running environment, such as resource status and environment variables. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Script logs | Records information about the script execution process. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource reclamation log | Records information about the resource reclaiming process. | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Uninstallation clearing logs | Records information about operations performed during service uninstallation, such as directory deletion and execution time | + +---------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Audit logs +---------- + +Audit information recorded in audit logs includes FusionInsight Manager audit information and component audit information. + +.. table:: **Table 4** Audit information of FusionInsight Manager + + +-----------------------+------------------------+-------------------------------------------------------------+ + | Audit Log | Operation Type | Operation | + +=======================+========================+=============================================================+ + | Manager audit log | User management | Creating a user | + | | | | + | | | Modifying a user | + | | | | + | | | Deleting a user | + | | | | + | | | Creating a user group | + | | | | + | | | Modifying a user group | + | | | | + | | | Deleting a user group | + | | | | + | | | Adding a role | + | | | | + | | | Modifying a role | + | | | | + | | | Deleting a role | + | | | | + | | | Changing a password policy | + | | | | + | | | Changing a password | + | | | | + | | | Resetting a password | + | | | | + | | | User login | + | | | | + | | | User logout | + | | | | + | | | Unlocking the screen | + | | | | + | | | Downloading the authentication credential | + | | | | + | | | Unauthorized operation | + | | | | + | | | Unlocking a user account | + | | | | + | | | Locking a user account | + | | | | + | | | Locking the screen | + | | | | + | | | Exporting user information | + | | | | + | | | Exporting a user group | + | | | | + | | | Exporting a role | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Tenant management | Saving the static configuration | + | | | | + | | | Adding a tenant | + | | | | + | | | Deleting a tenant | + | | | | + | | | Associating a service with a tenant | + | | | | + | | | Deleting a service from a tenant | + | | | | + | | | Configuring resources | + | | | | + | | | Creating resources | + | | | | + | | | Deleting resources | + | | | | + | | | Adding a resource pool | + | | | | + | | | Modifying a resource pool | + | | | | + | | | Deleting a resource pool | + | | | | + | | | Restoring tenant data | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Cluster management | Starting a cluster | + | | | | + | | | Stopping a cluster | + | | | | + | | | Saving configurations | + | | | | + | | | Synchronizing cluster configurations | + | | | | + | | | Customizing cluster monitoring indicators | + | | | | + | | | Saving monitoring thresholds | + | | | | + | | | Downloading a client configuration file | + | | | | + | | | Configuring the northbound API | + | | | | + | | | Configuring the northbound SNMP API | + | | | | + | | | Creating a threshold template | + | | | | + | | | Deleting a threshold template | + | | | | + | | | Applying a threshold template | + | | | | + | | | Saving cluster monitoring configuration data | + | | | | + | | | Exporting configuration data | + | | | | + | | | Importing cluster configuration data | + | | | | + | | | Exporting an installation template | + | | | | + | | | Modifying a threshold template | + | | | | + | | | Canceling the application of a threshold template | + | | | | + | | | Masking alarms | + | | | | + | | | Sending an alarm | + | | | | + | | | Changing the OMS database password | + | | | | + | | | Changing the component database password | + | | | | + | | | Starting the health check of a cluster | + | | | | + | | | Updating the health check configuration | + | | | | + | | | Exporting cluster health check results | + | | | | + | | | Importing a certificate file | + | | | | + | | | Deleting historical health check reports | + | | | | + | | | Exporting historical health check reports | + | | | | + | | | Customizing report monitoring indicators | + | | | | + | | | Exporting report monitoring data | + | | | | + | | | Customizing monitoring indicators for static resource pools | + | | | | + | | | Exporting monitoring data of a static resource pool | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Service management | Starting a service | + | | | | + | | | Stopping a service | + | | | | + | | | Synchronizing service configurations | + | | | | + | | | Refreshing a service queue | + | | | | + | | | Customizing service monitoring indicators | + | | | | + | | | Restarting a service | + | | | | + | | | Exporting service monitoring data | + | | | | + | | | Importing service configuration data | + | | | | + | | | Starting the health check of a service | + | | | | + | | | Exporting service health check results | + | | | | + | | | Configuring the service | + | | | | + | | | Uploading a configuration file | + | | | | + | | | Downloading a configuration file | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Instance management | Synchronizing instance configurations | + | | | | + | | | Commissioning an instance | + | | | | + | | | Decommissioning an instance | + | | | | + | | | Starting an instance | + | | | | + | | | Stopping an instance | + | | | | + | | | Customizing instance monitoring indicators | + | | | | + | | | Restarting an instance | + | | | | + | | | Exporting instance monitoring data | + | | | | + | | | Importing instance configuration data | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Host management | Setting a node rack | + | | | | + | | | Starting all roles | + | | | | + | | | Stopping all roles | + | | | | + | | | Isolating a host | + | | | | + | | | Canceling host isolation | + | | | | + | | | Customizing host monitoring indicators | + | | | | + | | | Exporting host monitoring data | + | | | | + | | | Starting the health check of a host | + | | | | + | | | Exporting the health check result of a host | + +-----------------------+------------------------+-------------------------------------------------------------+ + | | Maintenance management | Exporting alarms | + | | | | + | | | Clearing alarms | + | | | | + | | | Exporting events | + | | | | + | | | Clearing alarms in batches | + | | | | + | | | Clearing alarm through SNMP | + | | | | + | | | Adding a trap target through SNMP | + | | | | + | | | Deleting a trap target through SNMP | + | | | | + | | | Checking alarms through SNMP | + | | | | + | | | Synchronizing alarms through SNMP | + | | | | + | | | Modifying audit dump configurations | + | | | | + | | | Exporting audit logs | + | | | | + | | | Collecting log files | + | | | | + | | | Downloading log files | + | | | | + | | | Uploading a file | + | | | | + | | | Deleting an uploaded file | + | | | | + | | | Creating a backup task | + | | | | + | | | Executing a backup task | + | | | | + | | | Stopping a backup task | + | | | | + | | | Deleting a backup task | + | | | | + | | | Modifying a backup task | + | | | | + | | | Locking a backup task | + | | | | + | | | Unlocking a backup task | + | | | | + | | | Creating a restoration task | + | | | | + | | | Executing a backup restoration task | + | | | | + | | | Stopping a restoration task | + | | | | + | | | Retrying a restoration task | + | | | | + | | | Deleting a restoration task | + +-----------------------+------------------------+-------------------------------------------------------------+ + +.. table:: **Table 5** Component audit information + + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | Audit Log | Operation Type | Operation | + +=======================+============================================+================================================================================================+ + | DBService audit log | Maintenance management | Performing backup restoration operations | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | HBase audit log | Data definition language (DDL) statement | Creating a table | + | | | | + | | | Deleting a table | + | | | | + | | | Modifying a table | + | | | | + | | | Adding a column family | + | | | | + | | | Modifying a column family | + | | | | + | | | Deleting a column family | + | | | | + | | | Enabling a table | + | | | | + | | | Disabling a table | + | | | | + | | | Modify the user information | + | | | | + | | | Changing a password | + | | | | + | | | User login | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Data manipulation language (DML) statement | Putting data (to the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables) | + | | | | + | | | Deleting data (from the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables) | + | | | | + | | | Checking and putting data (to the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables) | + | | | | + | | | Checking and deleting data (from the **hbase:meta**, **\_ctmeta\_**, and **hbase:acl** tables) | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Permission control | Assigning permissions to a user | + | | | | + | | | Canceling permission assigning | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | Hive audit logs | Metadata operation | Defining metadata, such as creating databases and tables | + | | | | + | | | Deleting metadata, such as deleting databases and tables | + | | | | + | | | Modifying metadata, such as adding columns and renaming tables | + | | | | + | | | Importing and exporting metadata | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Data maintenance | Loading data to a table | + | | | | + | | | Inserting data into a table | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Permissions management | Creating or deleting roles | + | | | | + | | | Granting/Reclaiming roles | + | | | | + | | | Granting/Reclaiming permissions | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | HDFS audit log | Permissions management | Managing permissions on files or folders | + | | | | + | | | Managing permissions on owner information files or folders | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | File operation | Creating a folder | + | | | | + | | | Creating a file | + | | | | + | | | Opening a file | + | | | | + | | | Appending file content | + | | | | + | | | Changing a file name | + | | | | + | | | Deleting a file or folder | + | | | | + | | | Setting time property of a file | + | | | | + | | | Setting the number of file copies | + | | | | + | | | Merging files | + | | | | + | | | Checking the file system | + | | | | + | | | File links | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | MapReduce audit log | Application running | Starting a Container request | + | | | | + | | | Stopping a Container request | + | | | | + | | | After Container request is completed, the status of the request is displayed as succeeded. | + | | | | + | | | After Container request is completed, the status of the request is displayed as failed. | + | | | | + | | | After Container request is completed, the status of the request is displayed as suspended. | + | | | | + | | | Submitting a task | + | | | | + | | | Ending a task | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | LdapServer audit log | Maintenance management | Adding an operating system user | + | | | | + | | | Adding a user group | + | | | | + | | | Adding a user to user group | + | | | | + | | | Deleting a user | + | | | | + | | | Deleting a group | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | KrbServer audit log | Maintenance management | Changing the password of a Kerberos account | + | | | | + | | | Adding a Kerberos account | + | | | | + | | | Deleting a Kerberos account | + | | | | + | | | Authenticating a user | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | Loader audit log | Security management | User login | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Metadata management | Querying connector information | + | | | | + | | | Querying a framework | + | | | | + | | | Querying step information | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Managing data source connections | Querying a data source connection | + | | | | + | | | Adding a data source connection | + | | | | + | | | Updating a data source connection | + | | | | + | | | Deleting a data source connection | + | | | | + | | | Activating a data source connection | + | | | | + | | | Disabling a data source connection | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Job management | Querying a job | + | | | | + | | | Creating a Job | + | | | | + | | | Updating a Job | + | | | | + | | | Deleting a job | + | | | | + | | | Activating a job | + | | | | + | | | Disabling a job | + | | | | + | | | Querying all execution records of a job | + | | | | + | | | Querying the latest execution record of a job | + | | | | + | | | Submitting a job | + | | | | + | | | Stopping a job | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | Hue audit log | Service startup | Starting Hue | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | User operation | User login | + | | | | + | | | User logout | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Task operation | Creating a job | + | | | | + | | | Modifying a job | + | | | | + | | | Deleting a job | + | | | | + | | | Submitting a task | + | | | | + | | | Saving a task | + | | | | + | | | Updating the status of a task | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | ZooKeeper audit log | Permissions management | Setting the access permission to Znode | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | Znode operation | Creating a Znode | + | | | | + | | | Deleting a Znode | + | | | | + | | | Configuring Znode data | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | Storm audit log | Nimbus | Submitting a topology | + | | | | + | | | Stopping a topology | + | | | | + | | | Reallocating a topology | + | | | | + | | | Deactivating a topology | + | | | | + | | | Activating a topology | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + | | UI | Stopping a topology | + | | | | + | | | Reallocating a topology | + | | | | + | | | Deactivating a topology | + | | | | + | | | Activating a topology | + +-----------------------+--------------------------------------------+------------------------------------------------------------------------------------------------+ + +MRS audit logs are stored in the database. You can view and export audit logs on the **Audit** page. + +The following table lists the directories to store component audit logs. Audit log files of some components are stored in **/var/log/Bigdata/audit**, such as HDFS, HBase, MapReduce, Hive, Hue, Yarn, Storm, and ZooKeeper. The component audit logs are automatically compressed and backed up to **/var/log/Bigdata/audit/bk** at 03: 00 every day. A maximum of latest 90 compressed backup files are retained, and the backup time cannot be changed. + +Audit log files of other components are stored in the component log directory. + +.. table:: **Table 6** Directory for storing component audit logs + + +-----------------------------------+-------------------------------------------------------------------------+ + | Component | Audit Log Directory | + +===================================+=========================================================================+ + | DBService | /var/log/Bigdata/audit/dbservice/dbservice_audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | HDFS | /var/log/Bigdata/audit/hdfs/nn/hdfs-audit-namenode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/dn/hdfs-audit-datanode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/jn/hdfs-audit-journalnode.log | + | | | + | | /var/log/Bigdata/audit/hdfs/zkfc/hdfs-audit-zkfc.log | + | | | + | | /var/log/Bigdata/audit/hdfs/httpfs/hdfs-audit-httpfs.log | + | | | + | | /var/log/Bigdata/audit/hdfs/router/hdfs-audit-router.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | MapReduce | /var/log/Bigdata/audit/mapreduce/jobhistory/mapred-audit-jobhistory.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Hive | /var/log/Bigdata/audit/hive/hiveserver/hive-audit.log | + | | | + | | /var/log/Bigdata/audit/hive/metastore/metastore-audit.log | + | | | + | | /var/log/Bigdata/audit/hive/webhcat/webhcat-audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Loader | /var/log/Bigdata/loader/audit/default.audit | + +-----------------------------------+-------------------------------------------------------------------------+ + | Hue | /var/log/Bigdata/audit/hue/hue-audits.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | ZooKeeper | /var/log/Bigdata/audit/zookeeper/quorumpeer/zk-audit-quorumpeer.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Spark | /var/log/Bigdata/audit/spark/jdbcserver/jdbcserver-audit.log | + | | | + | | /var/log/Bigdata/audit/spark/jobhistory/jobhistory-audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Yarn | /var/log/Bigdata/audit/yarn/rm/yarn-audit-resourcemanager.log | + | | | + | | /var/log/Bigdata/audit/yarn/nm/yarn-audit-nodemanager.log | + +-----------------------------------+-------------------------------------------------------------------------+ + | Storm | /var/log/Bigdata/audit/storm/nimbus/audit.log | + | | | + | | /var/log/Bigdata/audit/storm/ui/audit.log | + +-----------------------------------+-------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/configuring_audit_log_exporting_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/configuring_audit_log_exporting_parameters.rst new file mode 100644 index 0000000..fde10ec --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/configuring_audit_log_exporting_parameters.rst @@ -0,0 +1,63 @@ +:original_name: mrs_01_0270.html + +.. _mrs_01_0270: + +Configuring Audit Log Exporting Parameters +========================================== + +Scenario +-------- + +If MRS audit logs are stored in the system for a long time, the disk space of the data directory may be insufficient. Therefore, you can set export parameters to automatically export audit logs to a specified directory on the OBS server timely, facilitating audit log management. + +.. note:: + + Audit logs exported to the OBS server include service audit logs and management audit logs. + + - Service audit logs are automatically compressed and stored in the **/var/log/Bigdata/audit/bk/** directory on the active management node at 03:00 every day. The file name format is <*yyyy-MM-dd_HH-mm-ss*>\ **.tar.gz**. By default, a maximum of seven log files can be stored. If more than seven log files are stored, the system automatically deletes the log files generated seven days ago. + - The data range of management audit logs exported to OBS each time is from the last date when the logs are successfully exported to OBS to the date when the task is executed. When the number of management audit logs reaches 100,000, the system automatically dumps the first 90,000 audit logs to a local file and retains 10,000 audit logs in the database. The dumped log files are saved in the **${BIGDATA_DATA_HOME}/dbdata_om/dumpData/iam/operatelog** directory on the active management node. The file name format is **OperateLog_store**\ *\_YY_MM_DD_HH_MM_SS*\ **.csv**. A maximum of 50 historical audit log files can be saved. + +Prerequisites +------------- + +- You have obtained the access key ID (AK) and secret access key (SK) of the account. +- A parallel file system has been created in OBS. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. Choose **Export Audit Log** under **Maintenance**. + + .. table:: **Table 1** Parameters for exporting audit logs + + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | Parameter | Value | Description | + +=======================+===========================================+====================================================================================================+ + | Export Audit Log | - |image1| | (Mandatory) Specifies whether to enable the audit log export function. | + | | - |image2| | | + | | | - |image3|: enables audit log exporting. | + | | | | + | | | - |image4|: disables audit log exporting. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | Start Time | 7/24/2017 09:00:00 (example value) | (Mandatory) Specifies the start time for exporting audit logs. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | Period (days) | 1 day (example value) | (Mandatory) Specifies the interval for exporting audit logs. The interval ranges from 1 to 5 days. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | Bucket | mrs-bucket (example value) | (Mandatory) Specifies the name of the OBS file system to which audit logs are exported. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | OBS path | **/opt/omm/oms/auditLog** (example value) | (Mandatory) Specifies the OBS path to which audit logs are exported. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | AK | *XXX* (example value) | (Mandatory) Specifies the user's access key ID. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + | SK | *XXX* (example value) | (Mandatory) Specifies the user's secret access key. | + +-----------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------+ + + .. note:: + + Audit logs are stored in **service_auditlog** and **manager_auditlog** on OBS, which are used to store service audit logs and management audit logs, respectively. + +.. |image1| image:: /_static/images/en-us_image_0000001349257205.png +.. |image2| image:: /_static/images/en-us_image_0000001295738112.png +.. |image3| image:: /_static/images/en-us_image_0000001349137625.png +.. |image4| image:: /_static/images/en-us_image_0000001296217548.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/exporting_service_logs.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/exporting_service_logs.rst new file mode 100644 index 0000000..da71d68 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/exporting_service_logs.rst @@ -0,0 +1,45 @@ +:original_name: mrs_01_0267.html + +.. _mrs_01_0267: + +Exporting Service Logs +====================== + +Scenario +-------- + +This section describes how to export logs generated by each service role from MRS Manager. + +Prerequisites +------------- + +- You have obtained the access key ID (AK) and secret access key (SK) of the account. +- A parallel file system has been created in OBS. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. Click **Export Log** under **Maintenance**. + +#. Set a service for **Service**. Set **Host** to the IP address of the host where the service is deployed. Select the corresponding time for **Start Time** and **End Time**. + +#. In **Export To**, select a path for saving logs. This parameter is available only for clusters with Kerberos authentication enabled. + + - **Local PC**: indicates that logs are saved to the local environment. Then go to :ref:`8 `. + - **OBS**: indicates that logs are saved to OBS. This is the default option. Then go to :ref:`5 `. + +#. .. _mrs_01_0267__en-us_topic_0035209626_li22688946162748: + + Set **OBS Path** to the path for storing service logs on OBS. + + The value must be a complete path and cannot start with a slash (**/**). The path can be nonexistent and will be automatically created by the system. The full path of OBS can contain a maximum of 900 bytes. + +#. In **Bucket**, enter the name of the created OBS file system. + +#. Set **AK** and **SK** to the access key ID and secret access key of the user. + +#. .. _mrs_01_0267__en-us_topic_0035209626_li58318105171043: + + Click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/index.rst new file mode 100644 index 0000000..d74ef02 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_0264.html + +.. _mrs_01_0264: + +Log Management +============== + +- :ref:`About Logs ` +- :ref:`Manager Log List ` +- :ref:`Viewing and Exporting Audit Logs ` +- :ref:`Exporting Service Logs ` +- :ref:`Configuring Audit Log Exporting Parameters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + about_logs + manager_log_list + viewing_and_exporting_audit_logs + exporting_service_logs + configuring_audit_log_exporting_parameters diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/manager_log_list.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/manager_log_list.rst new file mode 100644 index 0000000..8194416 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/manager_log_list.rst @@ -0,0 +1,367 @@ +:original_name: mrs_01_1227.html + +.. _mrs_01_1227: + +Manager Log List +================ + +Log Description +--------------- + +**Log path**: The default storage path of Manager log files is **/var/log/Bigdata/**\ *Manager component*. + +- ControllerService: **/var/log/Bigdata/controller/** (operation & maintenance system (OMS) installation and run logs) +- Httpd: **/var/log/Bigdata/httpd** (httpd installation and run logs) +- logman: **/var/log/Bigdata/logman** (log packaging tool logs) +- NodeAgent: **/var/log/Bigdata/nodeagent** (NodeAgent installation and run logs) +- okerberos: **/var/log/Bigdata/okerberos** (okerberos installation and run logs) +- oldapserver: **/var/log/Bigdata/oldapserver** (oldapserver installation and run logs) +- MetricAgent: **/var/log/Bigdata/metric_agent** (MetricAgent run log) +- omm: **/var/log/Bigdata/omm** (omm installation and run logs) +- timestamp: **/var/log/Bigdata/timestamp** (NodeAgent startup time logs) +- tomcat: **/var/log/Bigdata/tomcat** (Web process logs) +- Patch: **/var/log/Bigdata/patch** (patch installation log) +- Sudo: **/var/log/Bigdata/sudo** (sudo script execution log) +- OS: **/var/log/**\ *message file* (OS system log) +- OS Performance: **/var/log/osperf** (OS performance statistics log) +- OS Statistics: **/var/log/osinfo/statistics** (OS parameter configuration log) + +**Log archiving rule**: + +The automatic compression and archiving function is enabled for Manager logs. By default, when the size of a log file exceeds 10 MB, the log file is automatically compressed. The naming rule of a compressed log file is as follows: <*Original log name*>-<*yyyy-mm-dd_hh-mm-ss*>.[*ID*].\ **log.zip** A maximum of 20 latest compressed files are reserved. + +.. table:: **Table 1** Manager logs + + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Log File Name | Description | + +====================+============================================================================================================================+=====================================================================================================================================+ + | Controller run log | controller.log | Log that records component installation, upgrade, patch installation, configuration, monitoring, alarms, and routine O&M operations | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_client.log | Run log of the Representational State Transfer (REST) API | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | acs.log | ACS run log file | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | acs_spnego.log | spnego user log in ACS | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | aos.log | AOS run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | plugin.log | AOS plug-in log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | backupplugin.log | Log that records the backup and restoration operations | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_config.log | Configuration run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_nodesetup.log | Controller loading task log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_root.log | System log of the Controller process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_trace.log | Log that records the remote procedure call (RPC) communication between Controller and NodeAgent | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_monitor.log | Monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_fsm.log | State machine log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_alarm.log | Controller alarm log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller_backup.log | Controller backup and recovery log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | install.log, distributeAdapterFiles.log, install_os_optimization.log | OMS installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oms_ctl.log | OMS startup and stop log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | installntp.log | NTP installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | modify_manager_param.log | Manager parameter modification log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | backup.log | OMS backup script run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | supressionAlarm.log | Alarm script run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | om.log | OM certificate generation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | backupplugin_ctl.log | Startup log of the backup and restoration plug-in process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | getLogs.log | Run log of the collection log script | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | backupAuditLogs.log | Run log of the audit log backup script | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | certStatus.log | Log that records regular certificate checks | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | distribute.log | Certificate distribution log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ficertgenetrate.log | Certificate replacement logs, including logs of level-2 certificates, CAS certificates, and httpd certificates | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | genPwFile.log | Log that records the generation of certificate password files | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | modifyproxyconf.log | Log that records the modification of the HTTPD proxy configuration | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | importTar.log | Log that records the process of importing certificates into the trust library | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | Httpd | install.log | Httpd installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | access_log, error_log | Httpd run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | logman | logman.log | Log packaging tool log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | NodeAgent | install.log, install_os_optimization.log | NodeAgent installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | installntp.log | NTP installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | start_ntp.log | NTP startup log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ntpChecker.log | NTP check log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ntpMonitor.log | NTP monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | heartbeat_trace.log | Log that records heartbeats between NodeAgent and Controller | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | alarm.log | Alarm log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | monitor.log | Monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | nodeagent_ctl.log, start-agent.log | NodeAgent startup log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | agent.log | NodeAgent run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | cert.log | Certificate log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | agentplugin.log | Agent plug-in running status monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | omaplugin.log | OMA plug-in run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | diskhealth.log | Disk health check log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | supressionAlarm.log | Alarm script run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | updateHostFile.log | Host list update log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | collectLog.log | Run log of the node log collection script | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | host_metric_collect.log | Host index collection run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | checkfileconfig.log | Run log file of file permission check | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | entropycheck.log | Entropy check run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | timer.log | Log of periodic node scheduling | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | pluginmonitor.log | Component monitoring plug-in log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | agent_alarm_py.log | Log that records alarms upon insufficient NodeAgent file permission | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | okerberos | addRealm.log, modifyKerberosRealm.log | Domain handover log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | checkservice_detail.log | Okerberos health check log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | genKeytab.log | keytab generation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | KerberosAdmin_genConfigDetail.log | Run log that records the generation of kadmin.conf when starting the kadmin process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | KerberosServer_genConfigDetail.log | Run log that records the generation of krb5kdc.conf when starting the krb5kdc process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oms-kadmind.log | Run log of the kadmin process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oms_kerberos_install.log, postinstall_detail.log | Okerberos installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oms-krb5kdc.log | Run log of the krbkdc process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | start_detail.log | Okerberos startup log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | realmDataConfigProcess.log | Log rollback for domain handover failure | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | stop_detail.log | Okerberos stop log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | oldapserver | ldapserver_backup.log | Oldapserver backup log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_chk_service.log | Oldapserver health check log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_install.log | Oldapserver installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_start.log | Oldapserver startup log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_status.log | Log that records the status of the Oldapserver process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_stop.log | Oldapserver stop log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_wrap.log | Oldapserver service management log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_uninstall.log | Oldapserver uninstallation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | restart_service.log | Oldapserver restart log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ldapserver_unlockUser.log | Log that records information about unlocking LDAP users and managing accounts | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | omm | omsconfig.log | OMS configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | check_oms_heartbeat.log | OMS heartbeat log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | monitor.log | OMS monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ha_monitor.log | HA_Monitor operation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ha.log | HA operation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | fms.log | Alarm log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | fms_ha.log | HA alarm monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | fms_script.log | Alarm control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | Alarm configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | iam.log | IAM log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | iam_script.log | IAM control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | iam_ha.log | IAM HA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | IAM configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | operatelog.log | IAM operation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | heartbeatcheck_ha.log | OMS heartbeat HA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | install_oms.log | OMS installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | pms_ha.log | HA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | pms_script.log | Monitoring control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | Monitoring configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | plugin.log | Monitoring plug-in run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | pms.log | Monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ha.log | HA run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | cep_ha.log | CEP HA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | cep_script.log | CEP control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | cep.log | CEP log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | config.log | CEP configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | omm_gaussdba.log | GaussDB HA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | gaussdb-.log | GaussDB run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | gs_ctl-.log | GaussDB control log archive log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | gs_ctl-current.log | GaussDB control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | gs_guc-current.log | GaussDB operation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | encrypt.log | Omm encryption log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | omm_agent_ctl.log | OMA control log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oma_monitor.log | OMA monitoring log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | install_oma.log | OMA installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | config_oma.log | OMA configuration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | omm_agent.log | OMA run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | acs.log | ACS resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | aos.log | AOS resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | controller.log | Controller resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | feed_watchdog.log | feed_watchdog resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | floatip.log | Floating IP address resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | ha_ntp.log | NTP resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | httpd.log | Httpd resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | okerberos.log | Okerberos resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oldap.log | OLdap resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | tomcat.log | Tomcat resource log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | send_alarm.log | Run log of the HA alarm sending script of the management node | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | timestamp | restart_stamp | NodeAgent start time log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | tomcat | cas.log, localhost_access_cas_log.log | CAS run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | catalina.log, catalina.out, host-manager.log, localhost.log, manager.log | Tomcat run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | localhost_access_web_log.log | Log that records the access to REST APIs of FusionInsight Manager | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | web.log | Run log of the web process | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | northbound_ftp_sftp.log, snmp.log | Northbound log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | watchdog | watchdog.log, feed_watchdog.log | watchdog run log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | patch | oms_installPatch.log | OMS patch installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | agent_installPatch.log | Agent patch installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | agent_uninstallPatch.log | Agent patch uninstallation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | NODE_AGENT_restoreFile.log | Agent patch restoration log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | NODE_AGENT_updateFile.log | Agent patch update log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | OMA_restoreFile.log | OMA patch restoration file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | OMA_updateFile.log | OMA patch update file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | CONTROLLER_restoreFile.log | CONTROLLER patch restoration file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | CONTROLLER_updateFile.log | CONTROLLER patch update file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | OMS_restoreFile.log | OMS patch restoration file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | oms_uninstallPatch.log | OMS patch uninstallation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | OMS_updateFile.log | OMS patch update file log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | | createStackConf.log, decompress.log, decompress_OMS.log, distrExtractPatchOnOMS.log, slimReduction.log, switch_adapter.log | Patch installation log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | sudo | sudo.log | Sudo script execution log | + +--------------------+----------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + +Log Levels +---------- + +:ref:`Table 2 ` describes the log levels provided by Manager. The priorities of log levels are FATAL, ERROR, WARN, INFO, and DEBUG in descending order. Logs whose levels are higher than or equal to the specified level are printed. The number of printed logs decreases as the specified log level increases. + +.. _mrs_01_1227__tce0bb52db5fc4d53a43987beff277cb7: + +.. table:: **Table 2** Log levels + + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + | Level | Description | + +=======+==================================================================================================================================+ + | FATAL | Logs of this level record fatal error information about the current event processing that may result in a system crash. | + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + | ERROR | Logs of this level record error information about the current event processing, which indicates that system running is abnormal. | + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + | WARN | Abnormal information about the current event processing. These abnormalities will not result in system faults. | + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + | INFO | Normal running status information about the system and events. | + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + | DEBUG | Logs of this level record the system information and system debugging information. | + +-------+----------------------------------------------------------------------------------------------------------------------------------+ + +Log Formats +----------- + +The following table lists the Manager log formats. + +.. table:: **Table 3** Log formats + + +------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Component | Format | Example | + +====================================================================================+====================================================================================+========================================================================================================================================================+=============================================================================================================================================================================+ + | Controller, Httpd, logman, NodeAgent, okerberos, oldapserver, omm, tomcat, upgrade | Controller, Httpd, logman, NodeAgent, okerberos, oldapserver, omm, tomcat, upgrade | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2015-06-30 00:37:09,067 INFO [pool-1-thread-1] Completed Discovering Node. com.XXX.hadoop.om.controller.tasks.nodesetup.DiscoverNodeTask.execute(DiscoverNodeTask.java:299) | + +------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/viewing_and_exporting_audit_logs.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/viewing_and_exporting_audit_logs.rst new file mode 100644 index 0000000..18e113b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/log_management/viewing_and_exporting_audit_logs.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_0265.html + +.. _mrs_01_0265: + +Viewing and Exporting Audit Logs +================================ + +Scenario +-------- + +This section describes how to view and export audit logs on MRS Manager. The audit logs can be used to trace security events, locate fault causes, and determine responsibilities. + +The system record the following log information: + +- User activity information, such as user login and logout, system user information modification, and system user group information modification +- User operation instruction information, such as cluster startup, stop, and software upgrade. + +Procedure +--------- + +- Viewing audit logs + + #. On MRS Manager, click **Audit** to view the default audit logs. + + If the audit content of an audit log contains more than 256 characters, click the expand button of the audit log to expand the audit details. Click **Log File** to download the complete file and view the information. + + - By default, records are sorted in descending order by the **Occurred** column. You can click **Operation Type**, **Severity**, **Occurred**, **User**, **Host**, **Service**, **Instance**, or **Operation Result** to change the sorting mode. + - All alarms of the same severity can be filtered by **Severity**. The results include cleared and uncleared alarms. + + Exported audit logs contain the following information: + + - **Sno**: indicates the number of audit logs generated by MRS Manager. The number is incremented by 1 when a new audit log is generated. + - **Operation Type**: indicates the operation type of a user operation. There are nine scenarios: **Alarm**, **Auditlog**, **Backup And Restoration**, **Cluster**, **Collect Log**, **Host**, **Service**, **Tenant** and **User_Manager**. **User_Manager** is supported only in clusters with Kerberos authentication enabled. Each scenario contains different operation types. For example, **Alarm** includes **Export alarms**; **Cluster** includes **Start cluster**, and **Tenant** include **Add tenant**. + - **Severity**: indicates the security level of each audit log, including **Critical**, **Major**, **Minor** and **Informational**. + - **Start Time**: indicates the time when the operation starts. The time is GMT+01:00 or GMT+02:00. + - **End Time**: indicates the time when the operation ends. The time is GMT+01:00 or GMT+02:00. + - **User IP Address**: indicates the IP address used by a user to perform operations. + - **User**: indicates the name of the user who performs the operation. + - **Host**: indicates the node where the user operation is performed. The information is not saved if the operation does not involve a node. + - **Service**: indicates the service in the cluster where the user operation is performed. The information is not saved if the operation does not involve a service. + - **Instance**: indicates the role instance in the cluster where the user operation is performed. The information is not saved if the operation does not involve a role instance. + - **Operation Result**: indicates the operation result, including **Successful**, **Failed** and **Unknown**. + - **Content**: indicates execution information of the user operation. + + #. Click **Advanced Search**. In the search area, set search criteria and click **Search** to view audit logs of the specified type. Click **Reset** to clear the search criteria. + + .. note:: + + **Start Time** and **End Time** specify the start time and end time of the time range. You can search for alarms generated within the time range. + +- Exporting audit logs + + #. In the audit log list, click **Export All** to export all logs. + #. In the audit log list, select the check box of a log and click **Export** to export the log. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/configuring_monitoring_metric_dumping.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/configuring_monitoring_metric_dumping.rst new file mode 100644 index 0000000..e2b83c2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/configuring_monitoring_metric_dumping.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_0235.html + +.. _mrs_01_0235: + +Configuring Monitoring Metric Dumping +===================================== + +You can configure interconnection parameters on MRS Manager to save monitoring metric data to a specified FTP server using the FTP or SFTP protocol. In this way, MRS clusters can interconnect with third-party systems. The FTP protocol does not encrypt data, which brings potential security risks. Therefore, the SFTP protocol is recommended. + +MRS Manager supports the collection of all the monitoring metric data in the managed clusters. The collection period is 30 seconds, 60 seconds, or 300 seconds. The monitoring metric data is stored to different monitoring files on the FTP server by collection period. The monitoring file naming rule is in the "*Cluster name*\ \_\ **metric**\ \_\ *Monitoring metric data collection period*\ \_\ *File saving time*\ **.log**" format. + +Prerequisites +------------- + +The ECS corresponding to the dump server must be in the same VPC as the Master node of the MRS cluster, and the Master node can access the IP address and specified port of the dump server. The FTP service on the dump server is running properly. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In **Configuration**, click **Configure Monitoring Metric Dump** under **Monitoring and Alarm**. + +#. :ref:`Table 1 ` describes dump parameters. + + .. _mrs_01_0235__en-us_topic_0035209602_table50198556114935: + + .. table:: **Table 1** Dump parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==============================================================================================================================================================================================================+ + | Dump Monitoring Metric | Mandatory. This parameter specifies whether to enable the monitoring metric data interconnection function.spe | + | | | + | | - |image1|: Enabled. | + | | - |image2|: Disabled. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP IP Address | Mandatory. This parameter specifies the FTP server for storing monitoring files after the monitoring indicator data is interconnected. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Port | Mandatory. This parameter specifies the port connected to the FTP server. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Username | Mandatory. This parameter specifies the username for logging in to the FTP server. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FTP Password | Mandatory. This parameter specifies the password for logging in to the FTP server. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Save Path | Mandatory. This parameter specifies the path for storing monitoring files on the FTP server. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dump Interval (s) | Mandatory. This parameter specifies the interval at which monitoring files are periodically stored on the FTP server, in seconds. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dump Mode | Mandatory. This parameter specifies the protocol used for sending monitoring files. This parameter is mandatory. The options are **FTP** and **SFTP**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SFTP Public Key | Optional. This parameter specifies the public key of the FTP server and is valid only when **Dump Mode** is set to **SFTP**. You are advised to configure a public key. Otherwise, security risks may arise. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to complete the settings. + +.. |image1| image:: /_static/images/en-us_image_0000001349057789.png +.. |image2| image:: /_static/images/en-us_image_0000001349137681.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/dashboard.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/dashboard.rst new file mode 100644 index 0000000..8693e8f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/dashboard.rst @@ -0,0 +1,184 @@ +:original_name: mrs_01_0107.html + +.. _mrs_01_0107: + +Dashboard +========= + +On MRS Manager, nodes in a cluster can be classified into management nodes, control nodes, and data nodes. The change trends of key host monitoring metrics on each type of node can be calculated and displayed as curve charts in reports based on the customized periods. If a host belongs to multiple node types, the metric statistics will be repeatedly collected. + +This section provides overview of MRS clusters and describes how to view, customize, and export node monitoring metrics on MRS Manager. + +Procedure +--------- + +#. Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. Choose **Dashboard** on MRS Manager. + +#. In **Period**, you can specify a period to view monitoring data. The options are as follows: + + - Real time + - Last 3 hours + - Last 6 hours + - Last 24 hours + - Last week + - Last month + - Last 3 months + - Last 6 months + - Customize. If you select this option, you can customize the period for viewing monitoring data. + +#. Click **View** to view monitoring data in a period. + + - You can view **Health Status** and **Roles** of each service on the **Service Summary** page of MRS Manager. + - Click |image1| above the curve chart to view details about a metric. + +#. Customize a monitoring report. + + a. Click **Customize** and select monitoring metrics to be displayed on MRS Manager. + + MRS Manager supports a maximum of 14 monitoring metrics, but at most 12 customized monitoring metrics can be displayed on the page. + + - Cluster Host Health Status + - Cluster Network Read Speed Statistics + - Host Network Read Speed Distribution + - Host Network Write Speed Distribution + - Cluster Disk Write Speed Statistics + - Cluster Disk Usage Statistics + - Cluster Disk Information + - Host Disk Usage Statistics + - Cluster Disk Read Speed Statistics + - Cluster Memory Usage Statistics + - Host Memory Usage Distribution + - Cluster Network Write Speed Statistics + - Host CPU Usage Distribution + - Cluster CPU Usage Statistics + + b. Click **OK** to save the selected monitoring metrics for display. + + .. note:: + + Click **Clear** to cancel all the selected monitoring metrics in a batch. + +#. Set an automatic refresh interval or click |image2| for an immediate refresh. + + The following refresh interval options are supported: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + + .. note:: + + If you select **Full Screen**, the **Dashboard** window will be maximized. + +#. Export a monitoring report. + + a. Select a period. The options are as follows: + + - Real time + - Last 3 hours + - Last 6 hours + - Last 24 hours + - Last week + - Last month + - Last 3 months + - Last 6 months + - Customize. If you select this option, you can customize a time of period to export a report. + + b. Click **Export**. MRS Manager will generate a report about the selected monitoring metrics in a specified time of period. Save the report. + + .. note:: + + To view the curve charts of monitoring metrics in a specified period, click **View**. + +For MRS 1.7.2 or earlier, the real-time monitoring page and historical report page are separated. The procedure is as follows. + +- Real-time monitoring + + #. Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + + #. On MRS Manager, choose **Dashboard** > **Real time**. + + - You can view **Health Status** and **Roles** of each service on the **Service Summary** page of MRS Manager. + + - The following are some of host monitoring metrics displayed on MRS Manager. + + - Cluster Host Health Status + - Host Network Read Speed Distribution + - Host Network Write Speed Distribution + - Cluster Disk Information + - Host Disk Usage Distribution + - Cluster Memory Usage + - Host Memory Usage Distribution + - Host CPU Usage Distribution + - Average Cluster CPU Usage + + You can click **Customize** to display the specified monitoring metrics. + + #. Set an automatic refresh interval or click |image3| for an immediate refresh. + + The following refresh interval options are supported: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + + .. note:: + + If you select **Full Screen**, the **Real-time Monitoring** window will be maximized. + +- Historical reports + + #. View a monitoring report. + + a. Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + + b. On MRS Manager, click **Dashboard**. + + c. Click **Historical Report** to view a report. + + By default, the report displays the monitoring metric statistics of the previous day. + + .. note:: + + If you select **Full Screen**, the **Historical Report** window will be maximized. + + #. Customize a monitoring report. + + a. Click **Customize** and select monitoring metrics to be displayed on MRS Manager. + + MRS Manager supports a maximum of 8 monitoring metrics, but at most 6 customized monitoring metrics can be displayed on the page. + + - Cluster Network Read Speed Statistics + - Cluster Disk Write Speed Statistics + - Cluster Disk Usage Statistics + - Cluster Disk Information + - Cluster Disk Read Speed Statistics + - Cluster Memory Usage Statistics + - Cluster Network Write Speed Statistics + - Cluster CPU Usage Statistics + + b. Click **OK** to save the selected monitoring metrics for display. + + .. note:: + + Click **Clear** to cancel all the selected monitoring metrics in a batch. + + #. Export a monitoring report. + + a. Select a period. + + The following options are available: **Last day**, **Last week**, **Last month**, **Last quarter**, and **Last half year** + + In **Time Range**, you can also specify exact start and end time. + + b. Click **Export**. MRS Manager will generate a report about the selected monitoring metrics in a specified time of period. Save the report. + + .. note:: + + To view the curve charts of monitoring metrics in a specified period, click **View**. + +.. |image1| image:: /_static/images/en-us_image_0000001296058400.png +.. |image2| image:: /_static/images/en-us_image_0000001348737865.png +.. |image3| image:: /_static/images/en-us_image_0000001348737865.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/index.rst new file mode 100644 index 0000000..21f5861 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0106.html + +.. _mrs_01_0106: + +Monitoring Management +===================== + +- :ref:`Dashboard ` +- :ref:`Managing Services and Monitoring Hosts ` +- :ref:`Managing Resource Distribution ` +- :ref:`Configuring Monitoring Metric Dumping ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + dashboard + managing_services_and_monitoring_hosts + managing_resource_distribution + configuring_monitoring_metric_dumping diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_resource_distribution.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_resource_distribution.rst new file mode 100644 index 0000000..da609c6 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_resource_distribution.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_0233.html + +.. _mrs_01_0233: + +Managing Resource Distribution +============================== + +On MRS Manager, you can query the top value curves, bottom value curves, or average data curves of key service and host monitoring metrics, that is, the resource distribution information. MRS Manager allows you to view the monitoring data of the last hour. + +You can also modify the resource distribution on MRS Manager to display both the top and bottom value curves in service and host resource distribution figures. + +Resource distribution of some monitoring metrics is not recorded. + +Procedure +--------- + +- View the resource distribution of service monitoring metrics. + + #. On MRS Manager, click **Services**. + + #. Select the target service from the service list. + + #. Click **Resource Distribution**. + + Select key metrics of the service from **Metric**. MRS Manager displays the resource distribution of the metrics in the last hour. + +- View the resource distribution of host monitoring metrics. + + #. Click **Hosts**. + + #. Click the name of the specified host in the host list. + + #. Click **Resource Distribution**. + + Select key metrics of the host from **Metrics**. MRS Manager displays the resource distribution of the metrics in the last hour. + +- Configure resource distribution. + + #. On MRS Manager, click **System**. + + #. In **Configuration**, click **Configure Resource Contribution Ranking** under **Monitoring and Alarm**. + + #. Change the number of resources to be displayed. + + - Set **Number of Top Resources** to the number of top values. + - Set **Number of Bottom Resources** to the number of bottom values. + + .. note:: + + The sum of the maximum value and minimum value of resource distribution cannot be greater than 5. + + #. Click **OK** to save the configurations. + + The message "Number of top and bottom resources saved successfully" is displayed in the upper right corner of the page. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_services_and_monitoring_hosts.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_services_and_monitoring_hosts.rst new file mode 100644 index 0000000..2f97f3b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/monitoring_management/managing_services_and_monitoring_hosts.rst @@ -0,0 +1,264 @@ +:original_name: mrs_01_0232.html + +.. _mrs_01_0232: + +Managing Services and Monitoring Hosts +====================================== + +You can manage the following status and indicators of all services (including role instances) and hosts on the MRS Manager: + +- Status information: includes operation, health, configuration, and role instance status. +- Metric information: includes key monitoring metrics for services. +- Metric export: allows you to export monitoring reports. + +.. note:: + + Set an automatic refresh interval or click |image1| for an immediate refresh. + + The following refresh interval options are supported: + + - Refresh every 30 seconds + - Refresh every 60 seconds + - Stop refreshing + +.. _mrs_01_0232__en-us_topic_0035209600_section37246995143046: + +Managing Service Monitoring +--------------------------- + +#. On MRS Manager, click **Services**. + + The service list includes **Service**, **Operating Status**, **Health Status**, **Configuration Status**, **Roles**, and **Operation** are displayed in the component list. + + - :ref:`Table 1 ` describes the service operating status. + + .. _mrs_01_0232__en-us_topic_0035209600_table4224219143943: + + .. table:: **Table 1** Service operating status + + +-----------------+------------------------------------------------------------------------+ + | Status | Description | + +=================+========================================================================+ + | Started | The service is started. | + +-----------------+------------------------------------------------------------------------+ + | Stopped | The service is stopped. | + +-----------------+------------------------------------------------------------------------+ + | Failed to start | Failed to start the role instance. | + +-----------------+------------------------------------------------------------------------+ + | Failed to stop | Failed to stop the role instance. | + +-----------------+------------------------------------------------------------------------+ + | Unknown | Indicates initial service status after the background system restarts. | + +-----------------+------------------------------------------------------------------------+ + + - :ref:`Table 2 ` describes the service health status. + + .. _mrs_01_0232__en-us_topic_0035209600_table43931038143943: + + .. table:: **Table 2** Service health status + + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +===================+====================================================================================================================================================================+ + | Good | Indicates that all role instances in the service are running properly. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Bad | Indicates that the running status of at least one role instance is **Faulty** or the status of the service on which the current service depends is abnormal. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Unknown | Indicates that all role instances in the service are in the **Unknown** state. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Concerning | Indicates that the background system is restarting the service. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Partially Healthy | Indicates that the status of the service on which the service depends is abnormal, and APIs related to the abnormal service cannot be invoked by external systems. | + +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + - :ref:`Table 3 ` describes the service health status. + + .. _mrs_01_0232__en-us_topic_0035209600_table16122213143943: + + .. table:: **Table 3** Service configuration status + + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +==============+==============================================================================================================================================================+ + | Synchronized | The latest configuration takes effect. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Expired | The latest configuration does not take effect after the parameter modification. Related services need to be restarted. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Failed | The communication is incorrect or data cannot be read or written during the parameter configuration. Use **Synchronize Configuration** to rectify the fault. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuring | Parameters are being configured. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Unknown | Current configuration status cannot be obtained. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, the **Service** column is sorted in ascending order. You can click the icon next to **Service**, **Operating Status**, **Health Status**, or **Configuration Status** to change the sorting mode. + +#. Click a specified service in the list to view its status and metric information. + +#. Customize monitoring metrics and export customized monitoring information. + + a. In the **Charts** area, click **Customize** to customize service monitoring metrics. + b. In **Period** area, select a time of period and click **View** to view the monitoring data within the time period. + c. Click **Export** to export the displayed metrics. + + **For MRS 1.7.2 or earlier:** + + a. In the **Real time** area, click **Customize** to customize service monitoring metrics. + b. Click **History** to go to the historical monitoring query page. + c. Select a time of period and click **View** to view the monitoring data within the time period. + d. Click **Export** to export the displayed metrics. + +.. _mrs_01_0232__en-us_topic_0035209600_section65508505145118: + +Managing Role Instances +----------------------- + +#. On MRS Manager, click **Services** and click the target service name in the service list. + +#. Click **Instance** to view the role status. + + The role instance list contains the **Role**, **Host Name**, **OM IP Address**, **Business IP Address**, **Rack**, **Operation Status**, **Health Status,** and **Configuration Status** of an instance. + + - :ref:`Table 4 ` shows the configuration status of a role instance. + + .. _mrs_01_0232__en-us_topic_0035209600_table37414462155145: + + .. table:: **Table 4** Role instance status + + +-----------------+------------------------------------------------------------------------------+ + | Status | Description | + +=================+==============================================================================+ + | Started | The role instance has been started. | + +-----------------+------------------------------------------------------------------------------+ + | Stopped | The role instance has been stopped. | + +-----------------+------------------------------------------------------------------------------+ + | Failed to start | Failed to start the role instance. | + +-----------------+------------------------------------------------------------------------------+ + | Failed to stop | Failed to stop the role instance. | + +-----------------+------------------------------------------------------------------------------+ + | Decommissioning | The role instance is being decommissioned. | + +-----------------+------------------------------------------------------------------------------+ + | Decommissioned | The role instance has been decommissioned. | + +-----------------+------------------------------------------------------------------------------+ + | Recommissioning | The role instance is being recommissioned. | + +-----------------+------------------------------------------------------------------------------+ + | Unknown | Indicates initial role instance status after the background system restarts. | + +-----------------+------------------------------------------------------------------------------+ + + - :ref:`Table 5 ` shows the health status of a role instance. + + .. _mrs_01_0232__en-us_topic_0035209600_table61889899144412: + + .. table:: **Table 5** Role instance health status + + +------------+------------------------------------------------------------------------------------------------+ + | Status | Description | + +============+================================================================================================+ + | Good | The role instance is running properly. | + +------------+------------------------------------------------------------------------------------------------+ + | Bad | The role instance is abnormal. For example, the port cannot be accessed if PID does not exist. | + +------------+------------------------------------------------------------------------------------------------+ + | Unknown | The host where a role instance resides does not connect to the background system. | + +------------+------------------------------------------------------------------------------------------------+ + | Concerning | The background system is restarting a role instance. | + +------------+------------------------------------------------------------------------------------------------+ + + - :ref:`Table 6 ` shows the configuration status of a role instance. + + .. _mrs_01_0232__en-us_topic_0035209600_table20951019144412: + + .. table:: **Table 6** Role instance configuration status + + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Description | + +==============+==============================================================================================================================================================+ + | Synchronized | The latest configuration takes effect. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Expired | The latest configuration does not take effect after the parameter modification. Related services need to be restarted. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Failed | The communication is incorrect or data cannot be read or written during the parameter configuration. Use **Synchronize Configuration** to rectify the fault. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuring | Parameters are being configured. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Unknown | Current configuration status cannot be obtained. | + +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, the **Role** column is sorted in ascending order. You can click the sorting icon next to **Role**, **Host Name**, **OM IP Address**, **Business IP Address**, **Rack**, **Operating Status**, **Health Status**, or **Configuration Status** to change the sorting mode. + + You can filter out all instances of the same role in the **Role** column. + + You can set search criteria in the role search area by clicking **Advanced Search**, and click **Search** to view specified role information. Click **Reset** to clear the search criteria. Fuzzy search is supported. + +#. Click the target role instance to view its status and metric information. + +#. Customize monitoring metrics and export customized monitoring information. + + a. In the **Charts** area, click **Customize** to customize service monitoring metrics. + b. In **Period** area, select a time of period and click **View** to view the monitoring data within the time period. + c. Click **Export** to export the displayed metrics. + + **For MRS 1.7.2 or earlier:** + + a. In the **Real time** area, click **Customize** to customize service monitoring metrics. + b. Click **History** to go to the historical monitoring query page. + c. Select a time of period and click **View** to view the monitoring data within the time period. + d. Click **Export** to export the displayed metrics. + +.. _mrs_01_0232__en-us_topic_0035209600_section47168733145426: + +Managing Hosts +-------------- + +#. On MRS Manager, click **Hosts** to view the status of all hosts. + + The host list contains the host name, management IP address, service IP address, rack, network speed, operating status, health status, disk usage, memory usage, and CPU usage. + + - :ref:`Table 7 ` shows the host operating status. + + .. _mrs_01_0232__en-us_topic_0035209600_table63059102152614: + + .. table:: **Table 7** Host operating status + + +----------+-----------------------------------------------------------------------+ + | Status | Description | + +==========+=======================================================================+ + | Normal | The host and service roles on the host are running properly. | + +----------+-----------------------------------------------------------------------+ + | Isolated | The host is isolated, and the service roles on the host stop running. | + +----------+-----------------------------------------------------------------------+ + + - :ref:`Table 8 ` describes the host health status. + + .. _mrs_01_0232__en-us_topic_0035209600_table48654081152619: + + .. table:: **Table 8** Host health status + + +---------+---------------------------------------------------------------------------------------+ + | Status | Description | + +=========+=======================================================================================+ + | Good | The host can properly send heartbeats. | + +---------+---------------------------------------------------------------------------------------+ + | Bad | The host fails to send heartbeats due to timeout. | + +---------+---------------------------------------------------------------------------------------+ + | Unknown | The host initial status is unknown during the operation of adding or deleting a host. | + +---------+---------------------------------------------------------------------------------------+ + + By default, the **Host Name** column is sorted by host name in ascending order. You can click the sorting icon next to **Host Name**, **OM IP Address**, **Business IP Address**, **Rack**, **Network Speed**, **Operating Status**, **Health Status**, **Disk Usage**, **Memory Usage**, or **CPU Usage** to change the sorting mode. + + You can set search criteria in the role search area by clicking **Advanced Search**, and click **Search** to view specified role information. Click **Reset** to clear the search criteria. Fuzzy search is supported. + +#. Click the target host in the host list to view its status and metric information. + +#. Customize monitoring metrics and export customized monitoring information. + + a. In the **Charts** area, click **Customize** to customize service monitoring metrics. + b. In **Period** area, select a time of period and click **View** to view the monitoring data within the time period. + c. Click **Export** to export the displayed metrics. + + **For MRS 1.7.2 or earlier:** + + a. In the **Real time** area, click **Customize** to customize service monitoring metrics. + b. Click **History** to go to the historical monitoring query page. + c. Select a time of period and click **View** to view the monitoring data within the time period. + d. Click **Export** to export the displayed metrics. + +.. |image1| image:: /_static/images/en-us_image_0000001348737925.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/changing_the_password_of_an_operation_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/changing_the_password_of_an_operation_user.rst new file mode 100644 index 0000000..207c534 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/changing_the_password_of_an_operation_user.rst @@ -0,0 +1,50 @@ +:original_name: mrs_01_0350.html + +.. _mrs_01_0350: + +Changing the Password of an Operation User +========================================== + +Scenario +-------- + +Passwords of **Human-machine** system users must be regularly changed to ensure MRS cluster security. This section describes how to change passwords on MRS Manager. + +If a new password policy needs to be used for the password modified by the user, follow instructions in :ref:`Modifying a Password Policy ` to modify the password policy and then perform the following operations to modify the password. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Changing a User Password `. + +Impact on the System +-------------------- + +If you have downloaded a user authentication file, download it again and obtain the keytab file after modifying the password of the MRS cluster user. + +Prerequisites +------------- + +- You have obtained the current password policies from the administrator. +- You have obtained the MRS Manager access address from the administrator. +- You have obtained a cluster with Kerberos authentication enabled or a common cluster with the EIP function enabled. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, move the mouse cursor to |image1| in the upper right corner. + + On the menu that is displayed, select **Change Password**. + +#. Fill in the **Old Password**, **New Password**, and **Confirm Password**. Click **OK**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +.. |image1| image:: /_static/images/en-us_image_0000001296217716.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_cross-cluster_mutual_trust_relationships.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_cross-cluster_mutual_trust_relationships.rst new file mode 100644 index 0000000..1908f88 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_cross-cluster_mutual_trust_relationships.rst @@ -0,0 +1,194 @@ +:original_name: mrs_01_0354.html + +.. _mrs_01_0354: + +Configuring Cross-Cluster Mutual Trust Relationships +==================================================== + +Scenario +-------- + +If cluster A needs to access the resources of cluster B, the mutual trust relationship must be configured between these two clusters. + +If no trust relationship is configured, resources of a cluster are available only for users in this cluster. MRS automatically assigns a unique **domain name** for each cluster to define the scope of resources for users. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Configuring Cross-Manager Mutual Trust Between Clusters `. + +Impact on the System +-------------------- + +- After cross-cluster mutual trust is configured, resources of a cluster become available for users in other cluster. User permission in the clusters must be regularly checked based on service and security requirements. +- After cross-cluster mutual trust is configured, the KrbServer service needs to be restarted and the cluster becomes unavailable during the restart. +- After cross-cluster mutual trust is configured, internal users **krbtgt/**\ *Local cluster domain name*\ **@**\ *External cluster domain name* and **krbtgt/**\ *External cluster domain name*\ **@**\ *Local cluster domain name* are added to the two clusters. The internal users cannot be deleted. For versions earlier than MRS 1.9.2, the default password is **Admin@123**, for MRS 1.9.2 or later, the default password is **Crossrealm@123**. + +Procedure +--------- + +#. .. _mrs_01_0354__l0784aeae88934d19b78de6cb8fb5bbc4: + + On the MRS management console, query all security groups of the two clusters. + + - If the security groups of the two clusters are the same, go to :ref:`3 `. + - If the security groups of the two clusters are different, go to :ref:`2 `. + +#. .. _mrs_01_0354__lebc29e3bd1dc48aea26ab1501ab7b5c6: + + On the VPC management console, add rules for each security group. + + Set **Protocol** to **ANY**, **Transfer Direction** to **Inbound**, + + and **Source** to **Security Group**. The source is the security group of the peer cluster. + + - For cluster A, add inbound rules to the security group, set **Source** to the security groups of cluster B (the peer cluster of cluster A). + - For cluster B, add inbound rules to the security group, set **Source** to the security groups of cluster A (the peer cluster of cluster B). + + .. note:: + + For a common cluster with Kerberos authentication disabled, perform step :ref:`1 ` to :ref:`2 ` to configure cross-cluster mutual trust. For a security cluster with Kerberos authentication enabled, after completing the preceding steps, proceed to the following steps for configuration. + +#. .. _mrs_01_0354__l313690838fdd4efe880c73a525f3a5dc: + + Log in to MRS Manager of the two clusters separately. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Click **Service** and check whether the **Health Status** of all components is **Good**. + + - If yes, go to :ref:`4 `. + - If no, contact technical support personnel for troubleshooting. + +#. .. _mrs_01_0354__l2b421fc6a59b49f198148465895e2332: + + Query configuration information. + + a. On MRS Manager of the two clusters, choose **Services** > **KrbServer** > **Instance**. Query the **OM IP Address** of the two KerberosServer hosts. + b. Click **Service Configuration**. Set **Type** to **All**. Choose **KerberosServer** > **Port** in the navigation tree on the left. Query the value of **kdc_ports**. The default value is **21732**. + c. Click **Realm** and query the value of **default_realm**. + +#. .. _mrs_01_0354__lf46908028ffa4276b982a3872741b63b: + + On MRS Manager of either cluster, modify the **peer_realms** parameter. + + .. table:: **Table 1** Parameter description + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================================================================================+ + | realm_name | Domain name of the mutual-trust cluster, that is, the value of **default_realm** obtained in step :ref:`4 `. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ip_port | KDC address of the peer cluster. Format: *IP address of a KerberosServer node in the peer cluster:kdc_port* | + | | | + | | The addresses of the two KerberosServer nodes are separated by a comma. For example, if the IP addresses of the KerberosServer nodes are 10.0.0.1 and 10.0.0.2 respectively, the value of this parameter is **10.0.0.1:21732,10.0.0.2:21732**. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + - To deploy trust relationships with multiple clusters, click |image1| to add items and specify relevant parameters. To delete an item, click |image2|. + - A cluster can have trust relationships with a maximum of 16 clusters. By default, no trust relationship exists between different clusters that are trusted by a local cluster. + +#. Click **Save Configuration**. In the dialog box that is displayed, select **Restart the affected services or instances** and click **OK**. If you do not select **Restart the affected services or instances**, manually restart the affected services or instances. + + After **Operation successful** is displayed, click **Finish**. + +#. .. _mrs_01_0354__l45679c1f701240a1bd1eaebbcc3ab4af: + + Exit MRS Manager and log in to it again. If the login is successful, the configurations are valid. + +#. Log in to MRS Manager of the other cluster and repeat step :ref:`5 ` to :ref:`7 `. + +Follow-up Operations +-------------------- + +After cross-cluster mutual trust is configured, the service configuration parameters are modified on MRS Manager and the service is restarted. Therefore, you need to prepare the client configuration file again and update the client. + +Scenario 1: + +Cluster A and cluster B (peer cluster and mutually trusted cluster) are the same type, for example, analysis cluster or streaming cluster. Follow instructions in :ref:`Updating a Client (Versions Earlier Than 3.x) ` to update the client configuration files of cluster A and B respectively. + +- Update the client configuration file of cluster A. +- Update the client configuration file of cluster B. + +Scenario 2: + +Cluster A and cluster B (peer cluster and mutually trusted cluster) are the different type. Perform the following steps to update the configuration files. + +- Update the client configuration file of cluster A to cluster B. +- Update the client configuration file of cluster B to cluster A. +- Update the client configuration file of cluster A. +- Update the client configuration file of cluster B. + +#. .. _mrs_01_0354__li26199321164818: + + Log in to MRS Manager of cluster A. + +#. .. _mrs_01_0354__li25485761174241: + + Click **Services**, and then **Download Client**. + +#. Set **Client Type** to **Only configuration files**. + +#. Set **Download to** to **Remote host**. + +#. Set **Host IP Address** to the IP address of the active Master node of cluster B, **Host Port** to 22, and **Save Path** to **/tmp**. + + - If the default port **22** for logging in to cluster B using SSH is changed, set **Host Port** to a new port. + - The value of **Save Path** contains a maximum of 256 characters. + +#. Set **Login User** to **root**. + + If another user is used, ensure that the user has permissions to read, write, and execute the save path. + +#. In **SSH Private Key**, select and upload the key file used for creating cluster B. + +#. Click **OK** to generate a client file. + + If the following information is displayed, the client file is saved. Click **Close**. + + .. code-block:: text + + Client files downloaded to the remote host successfully. + + If the following information is displayed, check the username, password, and security group configurations of the remote host. Ensure that the username and password are correct and an inbound rule of the SSH (22) port has been added to the security group of the remote host. And then, go to :ref:`2 ` to download the client again. + + .. code-block:: text + + Failed to connect to the server. Please check the network connection or parameter settings. + +#. Log in to the ECS of cluster B using VNC. For details, see **Instances** > **Logging In to a Windows ECS > Login Using VNC** in the *Elastic Cloud Server User Guide* + + Log in to the ECS. For details, see `Login Using an SSH Key `__. Set the ECS password and log in to the ECS in VNC mode. + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. .. _mrs_01_0354__li1470645152711: + + Run the following command to update the client configuration of cluster A to cluster B: + + **sh refreshConfig.sh** *Client installation directory* *Full path of the client configuration file package* + + For example, run the following command: + + **sh refreshConfig.sh /opt/Bigdata/client /tmp/MRS_Services_Client.tar** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + ReFresh components client config is complete. + Succeed to refresh components client config. + + .. note:: + + You can also refer to method 2 in :ref:`Updating a Client (Versions Earlier Than 3.x) ` to perform operations in :ref:`1 ` to :ref:`11 `. + +#. Repeat step :ref:`1 ` to :ref:`11 ` to update the client configuration file of cluster B to cluster A. + +#. Follow instructions in :ref:`Updating a Client (Versions Earlier Than 3.x) ` to update the client configuration file of the local cluster. + + - Update the client configuration file of cluster A. + - Update the client configuration file of cluster B. + +.. |image1| image:: /_static/images/en-us_image_0000001296058048.jpg +.. |image2| image:: /_static/images/en-us_image_0000001349257345.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_fine-grained_permissions_for_mrs_multi-user_access_to_obs.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_fine-grained_permissions_for_mrs_multi-user_access_to_obs.rst new file mode 100644 index 0000000..a9b6fc2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_fine-grained_permissions_for_mrs_multi-user_access_to_obs.rst @@ -0,0 +1,215 @@ +:original_name: mrs_01_0632.html + +.. _mrs_01_0632: + +Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS +===================================================================== + +When fine-grained permission control is enabled, you can configure OBS access permissions to implement access control on directories in OBS file systems. + +.. note:: + + This section applies only to MRS 2.\ *x* or earlier (excluding MRS 1.9.2). + +This function enables you to control MRS users' access to OBS resources. For example, if you allow user group A to only access log files in a specified OBS file system, perform the following operations: + +#. Configure an agency with OBS access permissions for an MRS cluster so that OBS can be accessed using the temporary AK/SK automatically obtained by the ECS. This prevents the AK/SK from being exposed in the configuration file. +#. Create a policy on the IAM console to allow access to log files in a specified OBS file system, and create an agency bound to the policy permission. +#. In the MRS cluster, bind the new agency to user group A so that user group A only has the permission to access log files in the specified OBS file system. + +In the following scenarios, the username used for submitting jobs is an internal username so that MRS multi-user access to OBS is not supported. + +- For spark-beeline, the internal username used for submitting jobs is **spark** in a security cluster and **omm** in a normal cluster. +- For the HBase shell, the internal username used for submitting jobs is **hbase** in a security cluster and **omm** in a normal cluster. +- For Presto, the internal username used for submitting jobs in the security cluster is **omm** or **hive**, and that in the normal cluster is **omm**. (Choose **Components** > **Presto** > **Service Configuration**. Change **Basic** to **All** in the parameter type drop-down box.) Then, search for and change the value of **hive.hdfs.impersonation.enabled** to **true** to enable MRS multi-user to access OBS with fine-grained permissions. + +Prerequisites +------------- + +- Fine-grained permission control has been enabled. For details about permissions management, see :ref:`Creating an MRS User `. +- You have a basic knowledge of **IAM Agencies** (see section "Agencies" in the *IAM User Guide*) and OBS fine-grained policies. + +Step 1: Configuring an Agency with OBS Access Permission for a Cluster +---------------------------------------------------------------------- + +#. Follow instructions in :ref:`Configuring a Storage-Compute Decoupled Cluster (Agency) ` to configure an agency with OBS access permissions. + + The agency takes effect for all users (including internal users) and user groups in the cluster. To control the permissions of users and user groups in the cluster to access OBS, perform the following operations. + +Step 2: Creating a Policy and an Agency on IAM +---------------------------------------------- + +Create policies with different access permissions and bind the policies to the agency. For details, see :ref:`Creating a Policy and an Agency on IAM `. + +Step 3: Configuring OBS Permission Control Mappings on the MRS Cluster Details Page +----------------------------------------------------------------------------------- + +#. On the MRS management console, choose **Clusters** > **Active Clusters** and click the cluster name. + +#. In the **Basic Information** area on the **Dashboard** tab page, click **Manage** next to **OBS Permission Control**. + +#. Click **Add Mapping** and set parameters according to :ref:`Table 1 `. + + .. _mrs_01_0632__table175454305220: + + .. table:: **Table 1** OBS permission control parameters + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+====================================================================================================================================================================================================================================================================+ + | IAM Agency | Select the agency created in :ref:`2 `. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | - **User**: User-level mapping | + | | - **Group**: User group-level mapping | + | | | + | | .. note:: | + | | | + | | - User-level mapping takes priority over user group-level mapping. If you select **Group**, you are advised to enter the primary group name in **MRS User (User Group)**. | + | | - Do not use the same username (user group) for multiple mapping records. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MRS User (User Group) | Use commas (,) to separate multiple names of users or user groups. | + | | | + | | .. note:: | + | | | + | | - If OBS permission control is not configured for a user and no AK and SK are configured, the OBS Operator permission in **MRS_ECS_DEFAULT_AGENCY** will be used for accessing OBS. You are advised not to bind the internal user of a component to an agency. | + | | - If you need to configure an agency for the internal user of a component when submitting a job in the following scenarios, the requirements are as follows: | + | | | + | | - To control permissions on spark-beeline operations, set the username to **spark** for a security cluster and **omm** for a normal cluster. | + | | - To control permissions on HBase shell operations, set the username to **hbase** for a security cluster and **omm** for a normal cluster. | + | | - To control permissions on Presto, set the username to **omm**, **hive**, and the username used for logging in to the client for a security cluster and **omm** and the username used for logging in to the client for a normal cluster. | + | | - If you want to use Hive to create tables in beeline mode, set the username to the internal user **hive**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. + +#. Select **I agree to authorize the trust relationships between MRS Users (Groups) and IAM agencies**, and click **OK**. The mapping between the MRS user and OBS permission is added. + + If |image1| appears next to **OBS Permission Control** on the **Dashboard** tab page or the mapping table has been updated for OBS permission control, the mapping takes effect. It takes about 1 minute to for the mapping to take effect. + + In the **Operation** column of the mapping list, you can edit or delete the added mapping. + + .. note:: + + - If OBS permission control is not configured for a user and no AK and SK are configured, the permissions owned by the agency configured for the cluster in the **Object Storage Service (OBS)** project will be used to access OBS. + - Regardless of whether OBS permission control is configured, AK/SK permission is used for accessing OBS once it is configured. + - Security Administrator permission is required to modify, create, or delete a mapping. + - To enable mapping changes to take effect in spark-line, hive beeline and Presto respectively, you need to restart Spark, exit beeline and enter again, and restart Presto respectively. + +Component Access to OBS When OBS Permission Control Is Enabled +-------------------------------------------------------------- + +#. Log in to any node in a cluster as user **root** using the password set during cluster creation. + +#. Set environment variables (In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation.). + + **source /opt/Bigdata/client/bigdata_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** **MRS cluster user** + + Example: **kinit admin** + +#. If the Kerberos authentication is disabled for the current cluster, run the following commands to log in. Note that you should create a user that belongs to the **supergroup** group by referring to :ref:`Creating a User ` and replace *XXXX* with the username: + + **mkdir /home/XXXX** + + **chown XXXX /home/XXXX** + + **su - XXXX** + +#. Access OBS. You do not need to configure the AK, SK, and endpoint. The OBS path format is **obs://buck_name/**\ **XXX**. + + Example: **hadoop fs -ls "obs://obs-example/job/hadoop-mapreduce-examples-3.1.2.jar"** + + .. note:: + + - If you want to use **hadoop fs** to delete files on OBS, use **hadoop fs -rm -skipTrash** to delete the files. + - If data import is not involved when a table is created using spark-sql and spark-beeline, OBS will not be accessed. That is, if you create a table in an OBS directory on which you do not have permission, the **CREATE TABLE** operation will still be successful, but the error message "**403 AccessDeniedException**" is displayed when you insert data. + +.. _mrs_01_0632__section163381225399: + +Creating a Policy and an Agency on IAM +-------------------------------------- + +#. .. _mrs_01_0632__li20781191935317: + + Create a policy on IAM. + + a. Log in to the IAM console. + + b. Choose **Permissions**. On the displayed page, click **Create Custom Policy**. + + c. Set parameters according to :ref:`Table 2 `. Obtain the customized OBS policy samples that are frequently used by referring to . + + .. _mrs_01_0632__table4781201918533: + + .. table:: **Table 2** Policy parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=========================================================================================================================================================================================================================================================================================================================================================================+ + | Policy Name | Only letters, digits, spaces, and special characters (-_.,) are allowed. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scope | Select **Global services**, because OBS is a global service. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Policy View | Select **Visual editor**. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Policy Content | #. **Allow**: Select **Allow**. | + | | #. **Select service**: Select **Object Storage Service (OBS)**. | + | | #. **Select action**: Select **WriteOnly**, **ReadOnly**, and **ListOnly**. | + | | #. **Specific resources**: | + | | | + | | #. Set **object** to **Specify resource path**, click **Add Resource Path**, and enter *obs_bucket_name/*\ **tmp/** and *obs_bucket_name*\ **/tmp/\***. The **/tmp** directory is used as an example. If you need to add permissions for other directories, perform the following steps to add the directories and resource paths of all objects in the directories. | + | | #. Set **bucket** to **Specify resource path**, click **Add Resource Path**, and enter *obs_bucket_name*. | + | | | + | | #. (Optional) Add request condition, which does not need to be added currently. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | (Optional) Brief description about the policy. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + If the data write operation of each component is implemented in **rename** mode, the permission to delete objects must be configured when data is written. + + d. Click **OK** to save the policy. + +#. .. _mrs_01_0632__li1894924154514: + + Create an agency on IAM. + + a. Log in to the IAM console. + + b. Choose **Agencies**. On the displayed page, click **Create Agency**. + + c. Set parameters according to :ref:`Table 3 `. + + .. _mrs_01_0632__table4901145420452: + + .. table:: **Table 3** Agency parameters + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================+ + | Agency Name | Only letters, digits, spaces, and special characters (-_.,) are allowed. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Agency Type | Select **Common account**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Delegated Account | Enter your cloud account, that is, the account you register using your mobile phone number. It cannot be a federated user or an IAM user created using your cloud account. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Validity Period | Set this parameter as required. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | (Optional) Brief description about the agency. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Permissions | #. In the **Project [Region]** column, locate the row where **OBS** is, click **Attach Policy**. | + | | #. Select the policy created in :ref:`1 ` to display it in **Selected Policies**. | + | | #. Click **OK**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. Click **OK** to save the agency. + + .. note:: + + If you modify an agency and policies bound to it after using the agency to access OBS, the modification will take effect within 15 minutes. + +.. |image1| image:: /_static/images/en-us_image_0000001349057773.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_users_to_access_resources_of_a_trusted_cluster.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_users_to_access_resources_of_a_trusted_cluster.rst new file mode 100644 index 0000000..8677323 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/configuring_users_to_access_resources_of_a_trusted_cluster.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_0355.html + +.. _mrs_01_0355: + +Configuring Users to Access Resources of a Trusted Cluster +========================================================== + +Scenario +-------- + +After cross-cluster mutual trust is configured, permission must be configured for users in the local cluster, so that the users can access the same resources in the peer cluster as the users in the peer cluster. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Assigning User Permissions After Cross-Cluster Mutual Trust Is Configured `. + +Prerequisites +------------- + +The mutual trust relationship has been configured between two clusters (clusters A and B). The clients of the clusters have been updated. + +Procedure +--------- + +#. Log in to MRS Manager of cluster A and choose **System** > **Manage User**. Check whether cluster A has accounts that are the same as those of cluster B. + + - If yes, go to :ref:`2 `. + - If no, go to :ref:`3 `. + +#. .. _mrs_01_0355__l3a44878c05474ed09661e8c1b21018df: + + Click |image1| on the left side of the username to unfold the detailed user information. Check whether the user group and role to which the user belongs meet the service requirements. + + For example, user **admin** of cluster A has the permission to access and create files in the **/tmp** directory of cluster A. Then go to :ref:`4 `. + +#. .. _mrs_01_0355__l818dc8bf23ba4e23a260eb70945d47c2: + + Create the accounts in cluster A and bind the accounts to the user group and roles required by the services. Then go to :ref:`4 `. + +#. .. _mrs_01_0355__l9137cfca08bf4e63a5eb91f29a64a9c9: + + Choose **Service** > **HDFS** > **Instance**. Query the **OM IP Address** of **NameNode (Active)**. + +#. Log in to the client of cluster B. + + For example, if you have updated the client on the Master2 node, log in to the Master2 node to use the client. For details, see :ref:`Using an MRS Client `. + +#. Run the following command to access the **/tmp** directory of cluster A. + + **hdfs dfs -ls hdfs://192.168.6.159:9820/tmp** + + In the preceding command, **192.168.6.159** is the IP address of the active NameNode of cluster A; **9820** is the default port for communication between the client and the NameNode. + + .. note:: + + For MRS 1.6.2 or earlier, the default port number is **25000**. For details, see :ref:`List of Open Source Component Ports `. + +#. Run the following command to create a file in the **/tmp** directory of cluster A: + + **hdfs dfs -touchz hdfs://192.168.6.159:9820/tmp/mrstest.txt** + + If you can query the **mrstest.txt** file in the **/tmp** directory of cluster A, the cross-cluster mutual trust is configured successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001349137821.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_role.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_role.rst new file mode 100644 index 0000000..6190399 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_role.rst @@ -0,0 +1,217 @@ +:original_name: mrs_01_0343.html + +.. _mrs_01_0343: + +Creating a Role +=============== + +Scenario +-------- + +This section describes how to create a role on Manager and authorize and manage Manager and components. + +Up to 1000 roles can be created on Manager. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Managing Roles `. + +Prerequisites +------------- + +- You have learned service requirements. +- You have obtained a cluster with Kerberos authentication enabled or a common cluster with the EIP function enabled. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, choose **System** > **Manage Role**. + +#. Click **Create Role** and fill in **Role Name** and **Description**. + + **Role Name** is mandatory and contains 3 to 30 characters. Only digits, letters, and underscores (_) are allowed. **Description** is optional. + +#. In **Permission**, set role permission. + + a. Click **Service Name** and select a name in **View Name**. + b. Select one or more permissions. + + .. note:: + + - The **Permission** parameter is optional. + - If you select **View Name** to set component permissions, you can enter a resource name in the **Search** box in the upper right corner and click |image1|. The search result is displayed. + - The search scope covers only directories with current permissions. You cannot search subdirectories. Search by keywords supports fuzzy match and is case-insensitive. Results of the next page can be searched. + + .. table:: **Table 1** Manager permission description + + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+========================================================================================================================================+ + | **Alarm** | Authorizes the Manager alarm function. You can select **View** to view alarms and **Management** to manage alarms. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Audit** | Authorizes the Manager audit log function. You can select **View** to view audit logs and **Management** to manage audit logs. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Dashboard** | Authorizes the Manager overview function. You can select **View** to view the cluster overview. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Hosts** | Authorizes the node management function. You can select **View** to view node information and **Management** to manage nodes. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Services** | Authorizes the service management function. You can select **View** to view service information and **Management** to manage services. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_cluster_management** | Authorizes the MRS cluster management function. You can select **Management** to use the MRS patch management function. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_configuration** | Authorizes the MRS cluster configuration function. You can select **Management** to configure MRS clusters on Manager. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_task** | Authorizes the MRS cluster task function. You can select **Management** to manage periodic tasks of MRS clusters on Manager. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Tenant** | Authorizes the Manager multi-tenant management function. You can select **Management** to manage multi-tenants. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + + .. table:: **Table 2** HBase permission description + + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+===================================================================================================================+ + | **SUPER_USER_GROUP** | Grants you HBase administrator permissions. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Global** | HBase resource type, indicating the whole HBase. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Namespace** | HBase resource type, indicating namespace, which is used to store HBase tables. It has the following permissions: | + | | | + | | - **Admin** permission to manage the namespace | + | | - **Create**: permission to create HBase tables in the namespace | + | | - **Read**: permission to access the namespace | + | | - **Write**: permission to write data to the namespace | + | | - **Execute**: permission to execute the coprocessor (Endpoint) | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Table** | HBase resource type, indicating a data table, which is used to store data. It has the following permissions: | + | | | + | | - **Admin**: permission to manage a data table | + | | - **Create**: permission to create column families and columns in a data table | + | | - **Read**: permission to read a data table | + | | - **Write**: permission to write data to a data table | + | | - **Execute**: permission to execute the coprocessor (Endpoint) | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **ColumnFamily** | HBase resource type, indicating a column family, which is used to store data. It has the following permissions: | + | | | + | | - **Create**: permission to create columns in a column family | + | | - **Read**: permission to read a column family | + | | - **Write**: permission to write data to a column family | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Qualifier** | HBase resource type, indicating a column, which is used to store data. It has the following permissions: | + | | | + | | - **Read**: permission to read a column | + | | - **Write**: permission to write data to a column | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of an HBase resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if **Read** and **Write** permissions are added to the **default** namespace, they are automatically added to the tables, column families, and columns in the namespace. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 3** HDFS permission description + + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+=====================================================================================================================================+ + | **Folder** | HDFS resource type, indicating an HDFS directory, which is used to store files or subdirectories. It has the following permissions: | + | | | + | | - **Read**: permission to access the HDFS directory | + | | - **Write**: permission to write data to the HDFS directory | + | | - **Execute**: permission to perform an operation. It must be selected when you add access or write permission. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | **Files** | HDFS resource type, indicating a file in HDFS. It has the following permissions: | + | | | + | | - **Read**: permission to access the file | + | | - **Write**: permission to write data to the file | + | | - **Execute**: permission to perform an operation. It must be selected when you add access or write permission. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + + Permissions of an HDFS directory of each level are not shared by directory types of sub-levels by default. For example, if **Read** and **Execute** permissions are added to the **tmp** directory, you must select **Recursive** for permissions to be added to subdirectories. + + .. table:: **Table 4** Hive permission description + + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+=======================================================================================================================+ + | **Hive Admin Privilege** | Grants you Hive administrator permissions. | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | **Database** | Hive resource type, indicating a Hive database, which is used to store Hive tables. It has the following permissions: | + | | | + | | - **Select**: permission to query the Hive database | + | | - **Delete**: permission to perform the deletion operation in the Hive database | + | | - **Insert**: permission to perform the insertion operation in the Hive database | + | | - **Create**: permission to perform the creation operation in the Hive database | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | **Table** | Hive resource type, indicating a Hive table, which is used to store data. It has the following permissions: | + | | | + | | - **Select**: permission to query the Hive table | + | | - **Delete**: permission to perform the deletion operation in the Hive table | + | | - **Update**: permission to perform the update operation in the Hive table | + | | - **Insert**: permission to perform the insertion operation in the Hive table | + | | - **Grant of Select**: permission to grant the **Select** permission to other users using Hive statements | + | | - **Grant of Delete**: permission to grant the **Delete** permission to other users using Hive statements | + | | - **Grant of Update**: permission to grant the **Update** permission to other users using Hive statements | + | | - **Grant of Insert**: permission to grant the **Insert** permission to other users using Hive statements | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of a Hive resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if **Select** and **Insert** permissions are added to the **default** database, they are automatically added to the tables and columns in the database. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 5** Yarn permission description + + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+==================================================================================================================================================+ + | **Cluster Admin Operations** | Grants you Yarn administrator permissions. | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **root** | Root queue of Yarn. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Parent Queue** | Yarn resource type, indicating a parent queue containing sub-queues. A root queue is a type of a parent queue. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Leaf Queue** | Yarn resource type, indicating a leaf queue. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of a Yarn resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if the **Submit** permission is added to the **root** queue, it is automatically added to the sub-queue. Permissions inherited by sub-queues will not be displayed as selected in the **Permission** table. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 6** Hue permission description + + +-------------------------------------------+------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+======================================================+ + | **Storage Policy Admin** | Grants you storage policy administrator permissions. | + +-------------------------------------------+------------------------------------------------------+ + +#. Click **OK**. Return to **Manage Role**. + +Related Tasks +------------- + +**Modifying a role** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage Role**. +#. In the row of the role to be modified, click **Modify** to modify role information. + + .. note:: + + If you modify permissions assigned by the role, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. + +**Deleting a role** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage Role**. +#. In the row of the role to be deleted, click **Delete**. +#. Click **OK**. The role is deleted. + +.. |image1| image:: /_static/images/en-us_image_0000001349057965.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user.rst new file mode 100644 index 0000000..c147c7a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user.rst @@ -0,0 +1,79 @@ +:original_name: mrs_01_0345.html + +.. _mrs_01_0345: + +Creating a User +=============== + +Scenario +-------- + +This section describes how to create users on Manager based on site requirements and specify their operation permissions to meet service requirements. + +Up to 1000 users can be created on Manager. + +If a new password policy needs to be used for a new user's password, follow instructions in :ref:`Modifying a Password Policy ` to modify the password policy and then perform the following operations to create a user. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Creating a User `. + +Prerequisites +------------- + +- Administrators have learned service requirements and created roles and role groups required by service scenarios. +- You have obtained a cluster with Kerberos authentication enabled or a common cluster with the EIP function enabled. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User**. + +#. Above the user list, click **Create User**. + +#. Configure parameters as prompted and enter a username in **Username**. + + .. note:: + + - A username that differs only in alphabetic case from an existing username is not allowed. For example, if **User1** has been created, you cannot create **user1**. + - When you use the user you created, enter the exactly correct username, which is case-sensitive. + - **Username** is mandatory and contains 3 to 20 characters. Only digits, letters, and underscores (_) are allowed. + - **root**, **omm**, and **ommdba** are reserved system user. Select another username. + +#. Set **User Type** to either **Human-machine** or **Machine-machine**. + + - **Human-machine** user: used for MRS Manager O&M scenarios and component client operation scenarios. If you select this user type, you need to enter a password and confirm the password in **Password** and **Confirm Password** accordingly. + - **Machine-machine** users: used for MRS application development scenarios. If you select this user type, you do not need to enter a password, because the password is randomly generated. + +#. In **User Group**, click **Select and Join User Group** to select user groups and add users to them. + + .. note:: + + - If roles have been added to user groups, the users can be granted with permissions of the roles. + - If you want to grant new users with Hive permissions, add the users to the Hive group. + - If a user needs to manage tenant resources, the user group must be assigned the **Manager_tenant** role and the role corresponding to the tenant. + - Users created on Manager cannot be added to the user group synchronized using the IAM user synchronization function. + +#. In **Primary Group**, select a group as the primary group for users to create directories and files. The drop-down list contains all groups selected in **User Group**. + +#. In **Assign Rights by Role**, click **Select and Add Role** to add roles for users based on onsite service requirements. + + .. note:: + + - When you create a user, if permissions of a user group that is granted to the user cannot meet service requirements, you can assign other created roles to the user. It takes 3 minutes to make role permissions granted to the new user take effect. + - Adding a role when you create a user can specify the user rights. + - A new user can access web UIs of HDFS, HBase, Yarn, Spark, and Hue even when roles are not assigned to the user. + +#. In **Description**, provide description based on onsite service requirements. + + **Description** is optional. + +#. Click **OK**. + + If a new user is used in the MRS cluster for the first time, for example, used for logging in to MRS Manager or using the cluster client, the password must be changed. For details, see :ref:`Changing the Password of an Operation User `. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user_group.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user_group.rst new file mode 100644 index 0000000..ce9e41d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/creating_a_user_group.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_0344.html + +.. _mrs_01_0344: + +Creating a User Group +===================== + +Scenario +-------- + +This section describes how to create user groups and specify their operation permissions on Manager. Management of single or multiple users can be unified in the user groups. After being added to a user group, users can obtain operation permissions owned by the user group. + +Manager supports a maximum of 100 user groups. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Managing User Groups `. + +Prerequisites +------------- + +- Administrators have learned service requirements and created roles required by service scenarios. +- You have obtained a cluster with Kerberos authentication enabled or a common cluster with the EIP function enabled. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User Group**. + +#. Above the user group list, click **Create User Group**. + +#. Input **Group Name** and **Description**. + + **Group Name** is mandatory and contains 3 to 20 characters. Only digits, letters, and underscores (_) are allowed. **Description** is optional. + +#. In **Role**, click **Select and Add Role** to select and add specified roles. + + If you do not add the roles, the user group you are creating now does not have the permission to use MRS clusters. + +#. Click **OK**. + +.. _mrs_01_0344__s855da92cb75446818be082dff6e197f1: + +Related Tasks +------------- + +**Modifying a user group** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User Group**. +#. In the row of a user group to be modified, click **Modify**. + + .. note:: + + If you change role permissions assigned to the user group, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. + +**Deleting a user group** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User Group**. +#. In the row of the user group to be deleted, click **Delete**. +#. Click **OK**. The user group is deleted. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..3d8a497 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst @@ -0,0 +1,186 @@ +:original_name: mrs_01_0342.html + +.. _mrs_01_0342: + +Default Users of Clusters with Kerberos Authentication Enabled +============================================================== + +User Classification +------------------- + +The MRS cluster provides the following three types of users. Users are advised to periodically change the passwords. It is not recommended to use the default passwords. + ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| User Type | Description | ++===================================+===================================================================================================================+ +| System user | - User created on Manager for MRS cluster O&M and service scenarios. There are two types of users: | +| | | +| | - **Human-machine** user: used for Manager O&M scenarios and component client operation scenarios. | +| | - **Machine-machine** user: used for MRS cluster application development scenarios. | +| | | +| | - User who runs OMS processes. | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| Internal system user | Internal user who performs process communications, saves user group information, and associates user permissions. | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| Database user | - User who manages OMS database and accesses data. | +| | - User who runs the database of service components (Hive, Hue, Loader, and DBService) | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ + +System User +----------- + +.. note:: + + - User **Idap** of the OS is required in the MRS cluster. Do not delete this account. Otherwise, the cluster may not work properly. Password management policies are maintained by the operation users. + - Reset the passwords when you change the passwords of user **ommdba** and user **omm** for the first time. Change the passwords periodically after retrieving them. + ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| Type | Username | Initial Password | Description | ++=========================================+=================+====================================================+==========================================================================================================================================+ +| System administrator of the MRS cluster | admin | Specified by the user during the cluster creation. | Manager administrator | +| | | | | +| | | | with the following permissions: | +| | | | | +| | | | - Common HDFS and ZooKeeper user permissions. | +| | | | - Permissions to submit and query MapReduce and Yarn tasks, manage Yarn queues, and access the Yarn web UI. | +| | | | - Permissions to submit, query, activate, deactivate, reassign, delete topologies, and operate all topologies of the Storm service. | +| | | | - Permissions to create, delete, authorize, reassign, consume, write, and query topics of the Kafka service. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | omm | Randomly generated by the system. | Internal running user of the MRS cluster system. This user is an OS user generated on all nodes and does not require a unified password. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | root | Set by the user. | User for logging in to the node in the MRS cluster. This user is an OS user generated on all nodes. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ + +Internal System Users +--------------------- + +.. note:: + + Do not delete the following internal system users. Otherwise, the cluster or components may not work properly. + ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| Type | Default User | Initial Password | Description | ++========================+=================+==================+===============================================================================================================================+ +| Component running user | hdfs | Hdfs@123 | This user is the HDFS system administrator and has the following permissions: | +| | | | | +| | | | #. File system operation permissions: | +| | | | | +| | | | - Views, modifies, and creates files. | +| | | | - Views and creates directories. | +| | | | - Views and modifies the groups where files belong. | +| | | | - Views and sets disk quotas for users. | +| | | | | +| | | | #. HDFS management operation permissions: | +| | | | | +| | | | - Views the web UI status. | +| | | | - Views and sets the active and standby HDFS status. | +| | | | - Enters and exits the HDFS in security mode. | +| | | | - Checks the HDFS file system. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | hbase | Hbase@123 | This user is the HBase system administrator and has the following permissions: | +| | | | | +| | | | - Cluster management permission: **Enable** and **Disable** operations on tables to trigger MajorCompact and ACL operations. | +| | | | - Grants and revokes permissions, and shuts down the cluster. | +| | | | - Table management permission: Creates, modifies, and deletes tables. | +| | | | - Data management permission: Reads and writes data in tables, column families, and columns. | +| | | | - Accesses the HBase web UI. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | mapred | Mapred@123 | This user is the MapReduce system administrator and has the following permissions: | +| | | | | +| | | | - Submits, stops, and views the MapReduce tasks. | +| | | | - Modifies the Yarn configuration parameters. | +| | | | - Accesses the Yarn and MapReduce web UI. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | spark | Spark@123 | This user is the Spark system administrator and has the following permissions: | +| | | | | +| | | | - Accesses the Spark web UI. | +| | | | - Submits Spark tasks. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ + +User Group Information +---------------------- + ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Default User Group | Description | ++=======================+================================================================================================================================================================================================================================+ +| hadoop | Users added to this user group have the permission to submit tasks to all Yarn queues. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| hbase | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| hive | Users added to this user group can use Hive. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| spark | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| supergroup | Users added to this user group can have the administrator permission of HBase, HDFS, and Yarn and can use Hive. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| check_sec_ldap | Used to test whether the active LDAP works properly. This user group is generated randomly in a test and automatically deleted after the test is complete. This is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_tenant | Tenant system user group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| System_administrator | MRS cluster system administrator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_viewer | MRS Manager system viewer group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_operator | MRS Manager system operator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_auditor | MRS Manager system auditor group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_administrator | MRS Manager system administrator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| compcommon | Internal system group for accessing public resources in a cluster. All system users and system running users are added to this user group by default. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| default_1000 | User group created for tenants, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafka | Kafka common user group. Users added to this group need to be granted with read and write permission by users in the **kafkaadmin** group before accessing the desired topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafkasuperuser | Users added to this group have permissions to read data from and write data to all topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafkaadmin | Kafka administrator group. Users added to this group have the permissions to create, delete, authorize, as well as read from and write data to all topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| storm | Storm common user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| stormadmin | Storm administrator user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| opentsdb | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| presto | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| flume | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| launcher-job | MRS internal group, which is used to submit jobs using V2 APIs. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| OS User Group | Description | ++===============+==================================================================================================================================+ +| wheel | Primary group of MRS internal running user **omm**. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| ficommon | MRS cluster common group that corresponds to **compcommon** for accessing public resource files stored in the OS of the cluster. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ + +Database User +------------- + +MRS cluster system database users include OMS database users and DBService database users. + +.. note:: + + Do not delete database users. Otherwise, the cluster or components may not work properly. + ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| Type | Default User | Initial Password | Description | ++====================+==============+===================+========================================================================================================================+ +| OMS database | ommdba | dbChangeMe@123456 | OMS database administrator who performs maintenance operations, such as creating, starting, and stopping applications. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | omm | ChangeMe@123456 | User for accessing OMS database data. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| DBService database | omm | dbserverAdmin@123 | Administrator of the GaussDB database in the DBService component. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | hive | HiveUser@ | User for Hive to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | hue | HueUser@123 | User for Hue to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | sqoop | SqoopUser@ | User for Loader to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | ranger | RangerUser@ | User for Ranger to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/deleting_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/deleting_a_user.rst new file mode 100644 index 0000000..36cae31 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/deleting_a_user.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0349.html + +.. _mrs_01_0349: + +**Deleting a User** +=================== + +The administrator can delete an MRS cluster user that is not required on MRS Manager. Deleting a user is allowed only in clusters with Kerberos authentication enabled or normal clusters with the EIP function enabled. + +.. note:: + + If you want to create a new user with the same name as user A after deleting user A who has submitted a job on the client or MRS console, you need to delete user A's residual folders when deleting user A. Otherwise, the newly created user A may fail to submit a job. + + To delete residual folders, log in to each Core node in the MRS cluster and run the following commands. In the following commands, **$user** indicates the folder named after the username. + + **cd /srv/BigData/hadoop/data1/nm/localdir/usercache/** + + **rm -rf $user** + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Deleting a User `. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row that contains the user to be deleted, choose **More** > **Delete**. +#. Click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/downloading_a_user_authentication_file.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/downloading_a_user_authentication_file.rst new file mode 100644 index 0000000..59d817b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/downloading_a_user_authentication_file.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0352.html + +.. _mrs_01_0352: + +Downloading a User Authentication File +====================================== + +Scenario +-------- + +When a user develops big data applications and runs them in an MRS cluster that supports Kerberos authentication, the user needs to prepare a **Machine-machine** user authentication file for accessing the MRS cluster. The keytab file in the authentication file can be used for user authentication. + +This section describes how to download a **Machine-machine** user authentication file and export the keytab file on Manager. This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +.. note:: + + Before downloading a **Human-machine** user authentication file, change the password for the user on MRS Manager to make the initial password set by the administrator invalid. Otherwise, the exported keytab file cannot be used. For details, see :ref:`Changing the Password of an Operation User `. + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Exporting an Authentication Credential File `. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of a user for whom you want to export the keytab file, choose **More** > **Download authentication credential** to download the authentication file. After the file is automatically generated, save it to a specified path and keep it secure. +#. Open the authentication file with a decompression program. + + - **user.keytab** indicates a user keytab file used for user authentication. + - **krb5.conf** indicates the configuration file of the authentication server. The application connects to the authentication server according to this configuration file information when authenticating users. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/index.rst new file mode 100644 index 0000000..6175e5c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/index.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_0340.html + +.. _mrs_01_0340: + +MRS Multi-User Permission Management +==================================== + +- :ref:`Users and Permissions of MRS Clusters ` +- :ref:`Default Users of Clusters with Kerberos Authentication Enabled ` +- :ref:`Creating a Role ` +- :ref:`Creating a User Group ` +- :ref:`Creating a User ` +- :ref:`Modifying User Information ` +- :ref:`Locking a User ` +- :ref:`Unlocking a User ` +- :ref:`Deleting a User ` +- :ref:`Changing the Password of an Operation User ` +- :ref:`Initializing the Password of a System User ` +- :ref:`Downloading a User Authentication File ` +- :ref:`Modifying a Password Policy ` +- :ref:`Configuring Cross-Cluster Mutual Trust Relationships ` +- :ref:`Configuring Users to Access Resources of a Trusted Cluster ` +- :ref:`Configuring Fine-Grained Permissions for MRS Multi-User Access to OBS ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + users_and_permissions_of_mrs_clusters + default_users_of_clusters_with_kerberos_authentication_enabled + creating_a_role + creating_a_user_group + creating_a_user + modifying_user_information + locking_a_user + unlocking_a_user + deleting_a_user + changing_the_password_of_an_operation_user + initializing_the_password_of_a_system_user + downloading_a_user_authentication_file + modifying_a_password_policy + configuring_cross-cluster_mutual_trust_relationships + configuring_users_to_access_resources_of_a_trusted_cluster + configuring_fine-grained_permissions_for_mrs_multi-user_access_to_obs diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/initializing_the_password_of_a_system_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/initializing_the_password_of_a_system_user.rst new file mode 100644 index 0000000..d2cb0ff --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/initializing_the_password_of_a_system_user.rst @@ -0,0 +1,78 @@ +:original_name: mrs_01_0351.html + +.. _mrs_01_0351: + +Initializing the Password of a System User +========================================== + +Scenario +-------- + +This section describes how to initialize a password on Manager if a user forgets the password or the password of a public account needs to be changed regularly. After password initialization, the user must change the password upon the first login. This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Initializing a Password `. + +Impact on the System +-------------------- + +If you have downloaded a user authentication file, download it again and obtain the keytab file after initializing the password of the MRS cluster user. + +Initializing the Password of a Human-Machine User +------------------------------------------------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User**. + +#. Locate the row that contains the user whose password is to be initialized, choose **More** > **Initialize password**, and change the password as prompted. + + In the window that is displayed, enter the password of the current administrator account and click **OK**. Then in **Initialize password**, click **OK**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +Initializing the Password of a Machine-Machine User +--------------------------------------------------- + +#. Prepare a client based on service conditions and log in to the node with the client installed. + +#. Run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to log in to the console as user **kadmin/admin**: + + .. note:: + + The default password of user **kadmin/admin** is **KAdmin@123**, which will expire upon your first login. Change the password as prompted and keep the new password secure. + + **kadmin -p kadmin/admin** + +#. Run the following command to reset the password of a component running user. This operation takes effect on all servers: + + **cpw** *Component running user name* + + For example, **cpw oms/manager**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/locking_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/locking_a_user.rst new file mode 100644 index 0000000..6e65b72 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/locking_a_user.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_0347.html + +.. _mrs_01_0347: + +Locking a User +============== + +This section describes how to lock users in MRS clusters. A locked user cannot log in to Manager or perform security authentication in the cluster. This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +A locked user can be unlocked by an administrator manually or until the lock duration expires. You can lock a user by using either of the following methods: + +- Automatic lock: Set **Number of Password Retries** in **Configure Password Policy**. If user login attempts exceed the parameter value, the user is automatically locked. For details, see :ref:`Modifying a Password Policy `. +- Manual lock: The administrator manually locks a user. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Locking a User `. + +The following describes how to manually lock a user. **Machine-machine** users cannot be locked. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of a user you want to lock, click **Lock User**. +#. In the window that is displayed, click **OK** to lock the user. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_a_password_policy.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_a_password_policy.rst new file mode 100644 index 0000000..e494387 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_a_password_policy.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_0353.html + +.. _mrs_01_0353: + +Modifying a Password Policy +=========================== + +Scenario +-------- + +.. important:: + + Because password policies are critical to the user management security, modify them based on service security requirements. Otherwise, security risks may be incurred. + +This section describes how to set password and user login security rules as well as user lock rules. Password policies set on MRS Manager take effect for **Human-machine** users only, because the passwords of **Machine-machine** users are randomly generated. This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +If a new password policy needs to be used for a new user's password or the password modified by the user, perform the following operations to modify the password policy first, and then follow instructions in :ref:`Creating a User ` or :ref:`Changing the Password of an Operation User `. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Configuring Password Policies `. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + +#. On MRS Manager, click **System**. + +#. Click **Configure Password Policy**. + +#. Modify password policies as prompted. For parameter details, see :ref:`Table 1 `. + + .. _mrs_01_0353__te7724a101fe349699fb86ccf4559e6a8: + + .. table:: **Table 1** Password policy parameter description + + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==============================================================+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | **Minimum Password Length** | Indicates the minimum number of characters a password contains. The value ranges from 8 to 32. The default value is **8**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Number of Character Types** | Indicates the minimum number of character types a password contains. The character types include uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=`). The value can be **3** or **4**. The default value **3** indicates that the password must contain at least three types of the following characters: uppercase letters, lowercase letters, digits, special characters, and spaces. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Password Validity Period (days)** | Indicates the validity period (days) of a password. The value ranges from 0 to 90. Value **0** means that the password is permanently valid. The default value is **90**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Password Expiration Notification Days** | Indicates the number of days to notify password expiration in advance. After the value is set, if the difference between the cluster time and the password expiration time is smaller than this value, the user receives password expiration notifications. When a user logs in to MRS Manager, a message is displayed, indicating that the password is about to expire and asking the user whether to change the password. The value ranges from **0** to *X* (*X* must be set to the half of the password validity period and rounded down). Value **0** indicates that no notification is sent. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Interval of Resetting Authentication Failure Count (min)** | Indicates the interval (minutes) of retaining incorrect password attempts. The value ranges from 0 to 1440. Value **0** indicates that the number of incorrect password attempts are permanently retained and value **1440** indicates that the number of incorrect password attempts are retained for one day. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Number of Password Retries** | Indicates the number of consecutive wrong passwords allowed before the system locks the user. The value ranges from 3 to 30. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Account Lock Duration (min)** | Indicates the time period for which a user is locked when the user lockout conditions are met. The value ranges from 5 to 120. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_user_information.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_user_information.rst new file mode 100644 index 0000000..81e45bb --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/modifying_user_information.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0346.html + +.. _mrs_01_0346: + +Modifying User Information +========================== + +Scenario +-------- + +This section describes how to modify user information on Manager, including information about the user group, primary group, role, and description. + +This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Modifying User Information `. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of a user to be modified, click **Modify**. + + .. note:: + + If you change user groups for a user or assign role permissions to a user, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/unlocking_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/unlocking_a_user.rst new file mode 100644 index 0000000..df4fdeb --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/unlocking_a_user.rst @@ -0,0 +1,23 @@ +:original_name: mrs_01_0348.html + +.. _mrs_01_0348: + +Unlocking a User +================ + +If a user is locked because the number of login attempts exceeds the value of **Number of Password Retries**, or the user is manually locked by the administrator, the administrator can unlock the user on Manager. This operation is supported only in clusters with Kerberos authentication enabled or common clusters with the EIP function enabled. + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + + For clusters of **MRS 3.\ x** or later, see :ref:`Unlocking a User `. + +Procedure +--------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of a user to be unlocked, click **Unlock User**. +#. In the window that is displayed, click **OK** to unlock the user. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/users_and_permissions_of_mrs_clusters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/users_and_permissions_of_mrs_clusters.rst new file mode 100644 index 0000000..c6e0864 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/mrs_multi-user_permission_management/users_and_permissions_of_mrs_clusters.rst @@ -0,0 +1,159 @@ +:original_name: mrs_01_0341.html + +.. _mrs_01_0341: + +Users and Permissions of MRS Clusters +===================================== + +Overview +-------- + +- **MRS Cluster Users** + + Indicate the security accounts of Manager, including usernames and passwords. These accounts are used to access resources in MRS clusters. Each MRS cluster in which Kerberos authentication is enabled can have multiple users. + +- **MRS Cluster Roles** + + Before using resources in an MRS cluster, users must obtain the access permission which is defined by MRS cluster objects. A cluster role is a set of one or more permissions. For example, the permission to access a directory in HDFS needs to be configured in the specified directory and saved in a role. + +Manager provides the user permission management function for MRS clusters, facilitating permission and user management. + +- Permission management: adopts the role-based access control (RBAC) mode. In this mode, permissions are granted by role to form a permission set. After one or more roles are allocated to a user, the user can obtain the permissions of the roles. +- User management: uses MRS Manager to uniformly manage users, adopts the Kerberos protocol for user identity verification, and employs Lightweight Directory Access Protocol (LDAP) to store user information. + +Permission Management +--------------------- + +Permissions provided by MRS clusters include the O&M permissions of Manager and components (such as HDFS, HBase, Hive, and Yarn). In actual application, permissions must be assigned to each user based on service scenarios. To facilitate permission management, Manager introduces the role function to allow administrators to select and assign specified permissions. Permissions are centrally viewed and managed in permission sets, enhancing user experience. + +A role is a logical entity that contains one or more permissions. Permissions are assigned to roles, and users can be granted the permissions by obtaining the roles. + +A role can have multiple permissions, and a user can be bound to multiple roles. + +- Role 1: is assigned operation permissions A and B. After role 1 is allocated to users a and b, users a and b can obtain operation permissions A and B. +- Role 2: is assigned operation permission C. After role 2 is allocated to users c and d, users c and d can obtain operation permission C. +- Role 3: is assigned operation permissions D and F. After role 3 is allocated to user a, user a can obtain operation permissions D and F. + +For example, if an MRS user is bound to the administrator role, the user becomes an administrator of the MRS cluster. + +:ref:`Table 1 ` lists the roles that are created by default on Manager. + +.. _mrs_01_0341__te7b79730a8de49dc9a52be8196677697: + +.. table:: **Table 1** Default roles and description + + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Default Role | Description | + +=======================+=================================================================================================================================+ + | default | Tenant role | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Manager_administrator | Manager administrator: This role has the permission to manage MRS Manager. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Manager_auditor | Manager auditor: This role has the permission to view and manage auditing information. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Manager_operator | Manager operator: This role has all permissions except tenant, configuration, and cluster management permissions. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Manager_viewer | Manager viewer: This role has the permission to view the information about systems, services, hosts, alarms, and auditing logs. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | System_administrator | System administrator: This role has the permissions of Manager administrators and all service administrators. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + | Manager_tenant | Manager tenant viewer: This role has the permission to view information on the **Tenant** page on MRS Manager. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------+ + +When creating a role on Manager, you can perform rights management for Manager and components, as shown in :ref:`Table 2 `. + +.. _mrs_01_0341__t1eebab7f372e49fbb70f1802069f1001: + +.. table:: **Table 2** Manager and component permission management + + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | Permission | Description | + +===================================+===============================================================================================+ + | Manager | Manager access and login permission. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | HBase | HBase administrator permission and permission for accessing HBase tables and column families. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | HDFS | HDFS directory and file permission. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | Hive | - Hive Admin Privilege | + | | | + | | Hive administrator permission. | + | | | + | | - Hive Read Write Privileges | + | | | + | | Hive data table management permission to set and manage the data of created tables. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | Hue | Storage policy administrator permissions. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + | Yarn | - Cluster Admin Operations | + | | | + | | Yarn administrator permission. | + | | | + | | - Scheduler Queue | + | | | + | | Queue resource management permission. | + +-----------------------------------+-----------------------------------------------------------------------------------------------+ + +User Management +--------------- + +MRS clusters that support Kerberos authentication use the Kerberos protocol and LDAP for user management. + +- Kerberos verifies the identity of the user when a user logs in to Manager or uses a component client. Identity verification is not required for clusters with Kerberos authentication disabled. +- LDAP is used to store user information, including user records, user group information, and permission information. + +MRS clusters can automatically update Kerberos and LDAP user data when users are created or modified on Manager. They can also automatically perform user identity verification and authentication and obtain user information when a user logs in to Manager or uses a component client. This ensures the security of user management and simplifies the user management tasks. Manager also provides the user group function for managing one or multiple users by type: + +- A user group is a set of users, which can be used to manage users by type. Users in the system can exist independently or in a user group. +- After a user is added to a user group to which roles are allocated, the role permission of the user group is assigned to the user. + +:ref:`Table 3 ` lists the user groups that are created by default on MRS Manager in MRS 3.x or earlier. + +For details about the default user groups displayed on FusionInsight Manager of MRS 3.\ *x* or later, see :ref:`User group `. + +.. _mrs_01_0341__td676ae12a3a64c008ec055b498a52d78: + +.. table:: **Table 3** Default user groups and description + + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Group | Description | + +================+================================================================================================================================================================================+ + | hadoop | Users added to this user group have the permission to submit tasks to all Yarn queues. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase | Common user group. Users added to this user group will not have any additional permission. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hive | Users added to this user group can use Hive. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark | Common user group. Users added to this user group will not have any additional permission. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | supergroup | Users added to this user group can have the administrator permission of HBase, HDFS, and Yarn and can use Hive. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flume | Common user group. Users added to this user group will not have any additional permission. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka | Kafka common user group. Users added to this group need to be granted with read and write permission by users in the **kafkaadmin** group before accessing the desired topics. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafkasuperuser | Users added to this group have permissions to read data from and write data to all topics. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafkaadmin | Kafka administrator group. Users added to this group have the permissions to create, delete, authorize, as well as read from and write data to all topics. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | storm | Storm common user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | stormadmin | Storm administrator user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +User **admin** is created by default for MRS clusters with Kerberos authentication enabled and is used for administrators to maintain the clusters. + +Process Overview +---------------- + +In practice, MRS cluster users must understand the service scenarios of big data and plan user permissions. Then, create roles and assign permissions to the roles on MRS Manager to meet service requirements. Manager provides the user group function for administrators to create user groups for managing users of one or multiple service scenarios of the same type. + +.. note:: + + If a role has the permission of HDFS, HBase, Hive, or Yarn respectively, the role can only use the corresponding functions of the component. To use Manager, the corresponding Manager permission must be added to the role. + + +.. figure:: /_static/images/en-us_image_0000001349057881.png + :alt: **Figure 1** Process of creating a user + + **Figure 1** Process of creating a user diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/canceling_host_isolation.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/canceling_host_isolation.rst new file mode 100644 index 0000000..0c2a053 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/canceling_host_isolation.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0256.html + +.. _mrs_01_0256: + +Canceling Host Isolation +======================== + +Scenario +-------- + +After the exception or fault of a host is handled, you must cancel the isolation of the host for proper usage. + +Users can cancel the isolation of a host on MRS Manager. + +Prerequisites +------------- + +- The host is in the **Isolated** state. +- The exception or fault of the host has been rectified. + +Procedure +--------- + +#. On MRS Manager, click **Hosts**. + +#. Select the check box of the host to be de-isolated. + +#. Choose **More** > **Cancel Host Isolation**, + +#. and click **OK** in the displayed dialog box. + + After **Operation successful.** is displayed, click **Finish**. The host is de-isolated successfully, and the value of **Operating Status** becomes **Normal**. + +#. Click the name of the de-isolated host to show its status, and click **Start All Roles**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_customized_service_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_customized_service_parameters.rst new file mode 100644 index 0000000..5c8ed58 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_customized_service_parameters.rst @@ -0,0 +1,69 @@ +:original_name: mrs_01_0247.html + +.. _mrs_01_0247: + +Configuring Customized Service Parameters +========================================= + +Each component of MRS supports all open-source parameters. You can modify some parameters for key application scenarios on MRS Manager. Some component clients may not include all parameters with open-source features. For component parameters that cannot be directly modified on Manager, users can add new parameters for components by using the configuration customization function on Manager. Newly added parameters are saved in component configuration files and take effect after restart. + +Impact on the System +-------------------- + +- After the service attributes are configured, the service needs to be restarted and cannot be accessed. +- You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. + +Prerequisites +------------- + +You have understood the meanings of parameters to be added, configuration files that have taken effect, and the impact on components. + +Procedure +--------- + +#. On MRS Manager, click **Services**. + +#. Select the target service from the service list. + +#. Click **Service Configuration**. + +#. Set **Type** to **All**. + +#. In the navigation tree, select **Customization**. The customized parameters of the current component are displayed on Manager. + + The configuration files that save the newly added customized parameters are displayed in the **Parameter File** column. Different configuration files may have same open source parameters. After the parameters in different files are set to different values, whether the configuration takes effect depends on the loading sequence of the configuration files by components. You can customize parameters for services and roles as required. Adding customized parameters for a single role instance is not supported. + +#. Based on the configuration files and parameter functions, locate the row where a specified parameter resides, enter the parameter name supported by the component in the **Name** column and enter the parameter value in the **Value** column. + + - You can click |image1| or |image2| to add or delete a user-defined parameter. You can delete a customized parameter only after you click |image3| for the first time. + - If you want to cancel the modification of a parameter value, click |image4| to restore it. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. Click **OK** to restart the services. + + After **Operation successful.** is displayed, click **Finish**. The service is started successfully. + +.. _mrs_01_0247__en-us_topic_0035251703_section32890065192053: + +Task Example +------------ + +**Configuring Customized Hive Parameters** + +Hive depends on HDFS. By default, Hive accesses the HDFS client. The configuration parameters to take effect are controlled by HDFS in a unified manner. For example, the HDFS parameter **ipc.client.rpc.timeout** affects the RPC timeout period for all clients to connect to the HDFS server. If you need to modify the timeout period for Hive to connect to HDFS, you can use the configuration customization function. After this parameter is added to the **core-site.xml** file of Hive, this parameter can be identified by the Hive service and its configuration overwrites the parameter configuration in HDFS. + +#. On MRS Manager, choose **Services** > **Hive** > **Service Configuration**. + +#. Set **Type** to **All**. + +#. In the navigation tree on the left, select **Customization** for the Hive service. The system displays the customized service parameters supported by Hive. + +#. In **core-site.xml**, locate the row that contains the **core.site.customized.configs** parameter, enter **ipc.client.rpc.timeout** in the **Name** column, and enter a new value in the **Value** column, for example, **150000**. The unit is millisecond. + +#. Click **Save Configuration** and select **Restart the affected services or instances**. Click **OK** to restart the service. + + After **Operation successful.** is displayed, click **Finish**. The service is started successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001349057937.jpg +.. |image2| image:: /_static/images/en-us_image_0000001295898276.jpg +.. |image3| image:: /_static/images/en-us_image_0000001349057937.jpg +.. |image4| image:: /_static/images/en-us_image_0000001295738324.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_role_instance_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_role_instance_parameters.rst new file mode 100644 index 0000000..5b1c13b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_role_instance_parameters.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_0250.html + +.. _mrs_01_0250: + +Configuring Role Instance Parameters +==================================== + +Scenario +-------- + +You can view and modify default role instance configurations on MRS Manager based on site requirements. The configurations can be imported and exported. + +Impact on the System +-------------------- + +You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. + +Procedure +--------- + +- Modifying role instance configurations + + #. Click **Services**. + + #. Select the target service from the service list. + + #. Click the **Instances** tab. + + #. Click the target role instance from the role instance list. + + #. Click **Instance Configuration**. + + #. Set **Type** to **All**. The navigation tree of all configuration parameters of the role instance is displayed. + + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + If you want to cancel the modification of a parameter value, click |image1| to restore it. + + #. Click **Save Configuration**, select **Restart the role instance**, and click **OK** to restart the role instance. + + After **Operation successful.** is displayed, click **Finish**. The role instance is started successfully. + +- Exporting Configuration Parameters of a Role Instance + + #. Click **Services**. + #. Select a service. + #. Select a role instance or click the **Instances** tab. + #. Select a role instance on a specified host. + #. Click **Instance Configuration**. + #. Click **Export Instance Configuration** to export the configuration data of a specified role instance, and choose a path for saving the configuration file. + +- Import configuration data of a role instance. + + #. Click **Services**. + + #. Select a service. + + #. Select a role instance or click the **Instances** tab. + + #. Select a role instance on a specified host. + + #. Click **Instance Configuration**. + + #. Click **Import Instance Configuration** to import the configuration data of the specified role instance. + + #. Click **Save Configuration** and select **Restart the role instance**. Click **OK**. + + After **Operation successful.** is displayed, click **Finish**. The role instance is started successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001295738324.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_service_parameters.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_service_parameters.rst new file mode 100644 index 0000000..61a2558 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/configuring_service_parameters.rst @@ -0,0 +1,68 @@ +:original_name: mrs_01_0246.html + +.. _mrs_01_0246: + +Configuring Service Parameters +============================== + +On MRS Manager, you can view and modify the default service configurations based on site requirements and export or import the configurations. + +Impact on the System +-------------------- + +- You need to download and update the client configuration files after configuring HBase, HDFS, Hive, Spark, Yarn, and MapReduce service properties. +- The parameters of DBService cannot be modified when only one DBService role instance exists in the cluster. + +Procedure +--------- + +- Modify a service. + + #. Click **Services**. + + #. Select the target service from the service list. + + #. Click **Service Configuration**. + + #. Set **Type** to **All**. All configuration parameters of the service are displayed in the navigation tree. The root nodes from top down in the navigation tree represent the service names and role names. + + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + If you want to cancel the modification of a parameter value, click |image1| to restore it. + + .. note:: + + You can also use host groups to change role instance configurations in batches. Select a role name from the **Role** drop-down list and choose **< Select Host >** in the **Host** drop-down list. Enter a name in the **Host Group Name** text box, select the hosts to be modified from the **Host** list, add them to the **Selected hosts** area, and click **OK**. The added host group can be selected from **Host** and is only valid on the current page. The page cannot be saved after being refreshed. + + #. Click **Save Configuration** and select **Restart the affected services or instances**. Click **OK** to restart the services. + + After **Operation successful.** is displayed, click **Finish**. The service is started successfully. + + .. note:: + + To update the queue configuration of the Yarn service without restarting service, choose **More** > **Refresh Queue** to update the queue for the configuration to take effect. + +- Export service configuration parameters. + + #. Click **Services**. + #. Select a service. + #. Click **Service Configuration**. + #. Click **Export Service Configuration**. Select a path for saving the configuration files. + +- Import service configuration parameters. + + #. Click **Services**. + + #. Select a service. + + #. Click **Service Configuration**. + + #. Click **Import Service Configuration**. + + #. Select the target configuration file. + + #. Click **Save Configuration** and select **Restart the affected services or instances**. Click **OK**. + + After **Operation successful.** is displayed, click **Finish**. The service is started successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001295738324.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/decommissioning_and_recommissioning_a_role_instance.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/decommissioning_and_recommissioning_a_role_instance.rst new file mode 100644 index 0000000..1739192 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/decommissioning_and_recommissioning_a_role_instance.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_0252.html + +.. _mrs_01_0252: + +Decommissioning and Recommissioning a Role Instance +=================================================== + +Scenario +-------- + +If a Core or Task node is faulty, the cluster status may be displayed as **Abnormal**. In an MRS cluster, data can be stored on different Core nodes. Users can decommission the specified role instance on MRS Manager to stop the role instance from providing services. After fault rectification, you can recommission the role instance. + +For versions earlier than MRS 1.6.0, the following role instances can be decommissioned and recommissioned. + +- DataNode role instance on HDFS +- NodeManager role instance on Yarn + +The following role instances can be decommissioned and recommissioned for MRS 1.6.0 or later. + +- DataNode role instance on HDFS +- NodeManager role instance on Yarn +- RegionServer role instance on HBase +- Broker role instance on Kafka + +Restrictions: + +- If the number of the DataNodes is less than or equal to that of HDFS copies, decommissioning cannot be performed. If the number of HDFS copies is three and the number of DataNodes is less than four in the system, decommissioning cannot be performed. In this case, an error will be reported and the decommissioning will be stopped 30 minutes after the decommissioning attempt is performed on Manager. +- If the number of Kafka Broker instances is less than or equal to that of copies, decommissioning cannot be performed. For example, if the number of Kafka copies is two and the number of nodes is less than three in the system, decommissioning cannot be performed. Instance decommissioning will fail on Manager and exit. +- If a role instance is out of service, you must recommission the instance to start it before using it again. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. +#. Click a service in the service list. +#. Click the **Instances** tab. +#. Select an instance. +#. Choose **More** > **Decommission** or **Recommission** to perform the corresponding operation. + + .. note:: + + During the instance decommissioning, if the service corresponding to the instance is restarted in the cluster using another browser, MRS Manager displays a message indicating that the instance decommissioning is stopped, but the **Operating Status** of the instance is displayed as **Started**. In this case, the instance has been decommissioned on the background. You need to decommission the instance again to synchronize the operating status. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/exporting_configuration_data_of_a_cluster.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/exporting_configuration_data_of_a_cluster.rst new file mode 100644 index 0000000..ae4a86a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/exporting_configuration_data_of_a_cluster.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0260.html + +.. _mrs_01_0260: + +Exporting Configuration Data of a Cluster +========================================= + +Scenario +-------- + +You can export all configuration data of a cluster on MRS Manager to meet site requirements. The exported configuration data is used to rapidly update service configuration. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. + +#. Choose **More** > **Export Cluster Configuration**. + + The exported file is used to update service configurations. For details, see **Import service configuration parameters** in :ref:`Configuring Service Parameters `. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/index.rst new file mode 100644 index 0000000..3bb3300 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/index.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_0242.html + +.. _mrs_01_0242: + +Object Management +================= + +- :ref:`Managing Objects ` +- :ref:`Viewing Configurations ` +- :ref:`Managing Services ` +- :ref:`Configuring Service Parameters ` +- :ref:`Configuring Customized Service Parameters ` +- :ref:`Synchronizing Service Configurations ` +- :ref:`Managing Role Instances ` +- :ref:`Configuring Role Instance Parameters ` +- :ref:`Synchronizing Role Instance Configuration ` +- :ref:`Decommissioning and Recommissioning a Role Instance ` +- :ref:`Managing a Host ` +- :ref:`Isolating a Host ` +- :ref:`Canceling Host Isolation ` +- :ref:`Starting or Stopping a Cluster ` +- :ref:`Synchronizing Cluster Configurations ` +- :ref:`Exporting Configuration Data of a Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + managing_objects + viewing_configurations + managing_services + configuring_service_parameters + configuring_customized_service_parameters + synchronizing_service_configurations + managing_role_instances + configuring_role_instance_parameters + synchronizing_role_instance_configuration + decommissioning_and_recommissioning_a_role_instance + managing_a_host + isolating_a_host + canceling_host_isolation + starting_or_stopping_a_cluster + synchronizing_cluster_configurations + exporting_configuration_data_of_a_cluster diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/isolating_a_host.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/isolating_a_host.rst new file mode 100644 index 0000000..f4f9f6a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/isolating_a_host.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0255.html + +.. _mrs_01_0255: + +Isolating a Host +================ + +Scenario +-------- + +If a host is found to be abnormal or faulty, affecting cluster performance or preventing services from being provided, you can temporarily exclude that host from the available nodes in the cluster. In this way, the client can access other available nodes. In scenarios where patches are to be installed in a cluster, you can also exclude a specified node from patch installation. + +Users can isolate a host manually on MRS Manager based on the actual service requirements or O&M plan. Only non-management nodes can be isolated. + +Impact on the System +-------------------- + +- After a host is isolated, all role instances on the host will be stopped. You cannot start, stop, or configure the host and any instances on the host. +- After a host is isolated, statistics about the monitoring status and indicator data of the host hardware and instances on the host cannot be collected or displayed. + +Procedure +--------- + +#. On MRS Manager, click **Hosts**. + +#. Select the check box of the host to be isolated. + +#. Choose **More** > **Isolate Host**, + +#. and click **OK** in the displayed dialog box. + + After **Operation successful.** is displayed, click **Finish**. The host is isolated successfully, and the value of **Operating Status** becomes **Isolated**. + + .. note:: + + For isolated hosts, you can cancel the isolation and add them to the cluster again. For details, see :ref:`Canceling Host Isolation `. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_a_host.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_a_host.rst new file mode 100644 index 0000000..5b80e3c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_a_host.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0254.html + +.. _mrs_01_0254: + +Managing a Host +=============== + +Scenario +-------- + +When a host is abnormal or faulty, you need to stop all roles of the host on MRS Manager to check the host. After the host fault is rectified, start all roles running on the host to recover host services. + +Procedure +--------- + +#. Click **Hosts**. +#. Select the check box of the target host. +#. Choose **More** > **Start All Roles** or **Stop All Roles** accordingly. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_objects.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_objects.rst new file mode 100644 index 0000000..c9503dc --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_objects.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_0243.html + +.. _mrs_01_0243: + +Managing Objects +================ + +MRS contains different types of basic objects as described in :ref:`Table 1 `. + +.. _mrs_01_0243__en-us_topic_0035251699_table23400575171145: + +.. table:: **Table 1** MRS basic object overview + + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Object | Description | Example | + +==================+===============================================================================+============================================================================================================================+ + | Service | Function set that can complete specific business. | KrbServer service and LdapServer service | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Service instance | Specific instance of a service, usually called service. | KrbServer service | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Service role | Function entity that forms a complete service, usually called role. | KrbServer is composed of the KerberosAdmin role and KerberosServer role. | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Role instance | Specific instance of a service role running on a host. | KerberosAdmin that is running on Host2 and KerberosServer that is running on Host3 | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Host | An ECS running Linux OS. | Host1 to Host5 | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Rack | Physical entity that contains multiple hosts connecting to the same switch. | Rack1 contains Host1 to Host5. | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ + | Cluster | Logical entity that consists of multiple hosts and provides various services. | Cluster names **Cluster1** consists of five hosts (Host1 to Host5) and provides services such as KrbServer and LdapServer. | + +------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_role_instances.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_role_instances.rst new file mode 100644 index 0000000..21952ff --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_role_instances.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0249.html + +.. _mrs_01_0249: + +Managing Role Instances +======================= + +Scenario +-------- + +You can start a role instance that is in the **Stopped**, **Failed to stop** or **Failed to start** status, stop an unused or abnormal role instance or restart an abnormal role instance to recover its functions. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. +#. Select the target service from the service list. +#. Click the **Instances** tab. +#. Select the check box on the left of the target role instance. +#. Choose **More** > **Start Instance**, **Stop Instance**, or **Restart Instance** accordingly. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_services.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_services.rst new file mode 100644 index 0000000..2f8a993 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/managing_services.rst @@ -0,0 +1,27 @@ +:original_name: mrs_01_0245.html + +.. _mrs_01_0245: + +Managing Services +================= + +You can perform the following operations on MRS Manager: + +- Start the service in the **Stopped**, **Stop Failed**, or **Start Failed** state to use the service. +- Stop the services or stop abnormal services. +- Restart abnormal services or configure expired services to restore or enable the services. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. + +#. Locate the row that contains the target service, **Start**, **Stop**, or **Restart** to start, stop, or restart the service. + + Services are interrelated. If a service is started, stopped, and restarted, services dependent on it will be affected. + + The services will be affected in the following ways: + + - If a service is to be started, the lower-layer services dependent on it must be started first. + - If a service is stopped, the upper-layer services dependent on it are unavailable. + - If a service is restarted, the running upper-layer services dependent on it must be restarted. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/starting_or_stopping_a_cluster.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/starting_or_stopping_a_cluster.rst new file mode 100644 index 0000000..91df809 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/starting_or_stopping_a_cluster.rst @@ -0,0 +1,17 @@ +:original_name: mrs_01_0258.html + +.. _mrs_01_0258: + +Starting or Stopping a Cluster +============================== + +Scenario +-------- + +A cluster is a collection of service components. You can start or stop all services in a cluster. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. +#. In the upper part of the service list, choose **More** > **Start Cluster** or **Stop Cluster** accordingly. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_cluster_configurations.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_cluster_configurations.rst new file mode 100644 index 0000000..57cc403 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_cluster_configurations.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_0259.html + +.. _mrs_01_0259: + +Synchronizing Cluster Configurations +==================================== + +Scenario +-------- + +If **Configuration Status** of all services or some services is **Expired** or **Failed**, synchronize configuration for the cluster or service to restore its configuration status. + +- If all services in the cluster are in the **Failed** state, synchronize the cluster configuration with the background configuration. +- If all services in the cluster are in the **Failed** state, synchronize the service configuration with the background configuration. + +Impact on the System +-------------------- + +After synchronizing cluster configurations, you need to restart the services whose configurations have expired. These services are unavailable during restart. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. + +#. In the upper part of the service list, choose **More** > **Synchronize Configuration**. + +#. In the displayed dialog box, select **Restart services and instances whose configuration have expired**, and click **OK** to restart the service whose configuration has expired. + + When **Operation successful.** is displayed, click **Finish**. The service is started successfully. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_role_instance_configuration.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_role_instance_configuration.rst new file mode 100644 index 0000000..2852295 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_role_instance_configuration.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0251.html + +.. _mrs_01_0251: + +Synchronizing Role Instance Configuration +========================================= + +Scenario +-------- + +When **Configuration Status** of a role instance is **Expired** or **Failed**, you can synchronize the configuration data of the role instance with the background configuration. + +Impact on the System +-------------------- + +After synchronizing a role instance configuration, you need to restart the role instance whose configuration has expired. The role instance is unavailable during restart. + +Procedure +--------- + +#. On MRS Manager, click **Services** and select a service name. + +#. Click the **Instances** tab. + +#. Click the target role instance from the role instance list. + +#. Choose **More** > **Synchronize Configuration** above the role instance status and indicator information. + +#. In the displayed dialog box, select **Restart services and instances whose configuration have expired**, and click **OK** to restart the role instance. + + After **Operation successful** is displayed, click **Finish**. The role instance is started successfully. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_service_configurations.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_service_configurations.rst new file mode 100644 index 0000000..83247ea --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/synchronizing_service_configurations.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_0248.html + +.. _mrs_01_0248: + +Synchronizing Service Configurations +==================================== + +Scenario +-------- + +If **Configuration Status** of a service is **Expired** or **Failed**, synchronize configurations for the cluster or service to restore its configuration status. If all services in the cluster are in the **Failed** state, synchronize the cluster configuration with the background configuration. + +Impact on the System +-------------------- + +After synchronizing service configurations, you need to restart the services whose configurations have expired. These services are unavailable during restart. + +Procedure +--------- + +#. On MRS Manager page, click **Services**. + +#. Select the target service from the service list. + +#. In the upper part of the service status and metric information, choose **More** > **Synchronize Configuration**. + +#. In the displayed dialog box, select **Restart services and instances whose configuration have expired.** and click **OK** to restart the service whose configuration has expired. + + When **Operation successful** is displayed, click **Finish**. The service is started successfully. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/viewing_configurations.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/viewing_configurations.rst new file mode 100644 index 0000000..ba4771e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/object_management/viewing_configurations.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0244.html + +.. _mrs_01_0244: + +Viewing Configurations +====================== + +On MRS Manager, users can view the configurations of services (including roles) and role instances. + +Procedure +--------- + +- Query service configurations. + + #. On MRS Manager page, click **Services**. + + #. Select the target service from the service list. + + #. Click **Service Configuration**. + + #. Set **Type** to **All**. All configuration parameters of the service are displayed in the navigation tree. The root nodes from top down in the navigation tree represent the service names and role names. + + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + + The parameters under the service nodes and role nodes are service configuration parameters and role configuration parameters respectively. + + #. In the **Non-default** parameter, select **Non-default**. The parameters whose values are not default values will be displayed. + +- Query role instance configurations. + + #. On MRS Manager page, click **Services**. + #. Select the target service from the service list. + #. Click the **Instances** tab. + #. Click the target role instance from the role instance list. + #. Click **Instance Configuration**. + #. Set **Type** to **All**. The navigation tree of all configuration parameters of the role instance is displayed. + #. In the navigation tree, select a specified parameter and change its value. You can also enter the parameter name in the **Search** box to search for the parameter and view the result. + #. In the **Non-default** parameter, select **Non-default**. The parameters whose values are not default values will be displayed. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/index.rst new file mode 100644 index 0000000..3174457 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0574.html + +.. _mrs_01_0574: + +Patch Operation Guide +===================== + +- :ref:`Patch Operation Guide for Versions Earlier than MRS 1.7.0 ` +- :ref:`Patch Operation Guide for Versions from MRS 1.7.0 to MRS 2.1.0 ` +- :ref:`Supporting Rolling Patches ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + patch_operation_guide_for_versions_earlier_than_mrs_1.7.0 + patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0 + supporting_rolling_patches diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst new file mode 100644 index 0000000..a218d94 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_earlier_than_mrs_1.7.0.rst @@ -0,0 +1,63 @@ +:original_name: mrs_01_0575.html + +.. _mrs_01_0575: + +Patch Operation Guide for Versions Earlier than MRS 1.7.0 +========================================================= + +If you obtain patch information from the following sources, upgrade the patch according to actual requirements. + +- You obtain information about the patch released by MRS from a message pushed by the message center service. +- You obtain information about the patch by accessing the cluster and viewing patch information. + +Preparing for Patch Installation +-------------------------------- + +- Follow instructions in :ref:`Performing a Health Check ` to check cluster status. If the cluster health status is normal, install a patch. +- The administrator has uploaded the cluster patch package to the server. For details, see :ref:`Uploading the Patch Package `. +- You need to confirm the target patch to be installed according to the patch information in the patch content. + +.. _mrs_01_0575__en-us_topic_0109317365_section63677183610: + +Uploading the Patch Package +--------------------------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System** > **Manage Patch**. The **Manage Patch** page is displayed. +#. Click **Upload Patch** and set the following parameters. + + - **Patch File Path**: Folder created in the OBS file system where the patch package is stored, for example, **MRS_1.6.2/MRS_1_6_2_11.tar.gz** + - **Bucket**: Name of the OBS file system where the patch package is stored, for example, **mrs_patch** + + .. note:: + + You can obtain the file system name and patch file path on the **Patch Information** tab page. The value of the **Patch Path** is in the following format: [File system name]/[Patch file path]. + + - **AK**: For details, see **My Credential** > **Access Keys**. + - **SK**: For details, see **My Credential** > **Access Keys**. + +#. Click **OK** to upload the patch. + +Installing a Patch +------------------ + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System** > **Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **Install**. +#. In the displayed dialog box, click **OK** to install the patch. +#. After the patch is installed, you can view the installation status in the progress bar. If the installation fails, contact the administrator. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System** > **Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **Uninstall**. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst new file mode 100644 index 0000000..31cc0c1 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/patch_operation_guide_for_versions_from_mrs_1.7.0_to_mrs_2.1.0.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_0576.html + +.. _mrs_01_0576: + +Patch Operation Guide for Versions from MRS 1.7.0 to MRS 2.1.0 +============================================================== + +If you obtain patch information from the following sources, upgrade the patch according to actual requirements. + +- You obtain information about the patch released by MRS from a message pushed by the message center service. +- You obtain information about the patch by accessing the cluster and viewing patch information. + +Preparing for Patch Installation +-------------------------------- + +- Follow instructions in :ref:`Performing a Health Check ` to check cluster status. If the cluster health status is normal, install a patch. +- You need to confirm the target patch to be installed according to the patch information in the patch content. + +Installing a Patch +------------------ + +#. Log in to the MRS management console. +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patch Information** page, click **Install** in the **Operation** column to install the target patch. + + .. note:: + + - For details about rolling patch operations, see :ref:`Supporting Rolling Patches `. + - For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Log in to the MRS management console. +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patch Information** page, click **Uninstall** in the **Operation** column to uninstall the target patch. + + .. note:: + + - For details about rolling patch operations, see :ref:`Supporting Rolling Patches `. + - For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/supporting_rolling_patches.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/supporting_rolling_patches.rst new file mode 100644 index 0000000..5efa6d4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/patch_operation_guide/supporting_rolling_patches.rst @@ -0,0 +1,101 @@ +:original_name: mrs_01_0577.html + +.. _mrs_01_0577: + +Supporting Rolling Patches +========================== + +The rolling patch function indicates that patches are installed or uninstalled for one or more services in a cluster by performing a rolling service restart (restarting services or instances in batches), without interrupting the services or within a minimized service interruption interval. Services in a cluster are divided into the following three types based on whether they support rolling patch: + +- Services supporting rolling patch installation or uninstallation: All businesses or part of them (varying depending on different services) of the services are not interrupted during patch installation or uninstallation. +- Services not supporting rolling patch installation or uninstallation: Businesses of the services are interrupted during patch installation or uninstallation. +- Services with some roles supporting rolling patch installation or uninstallation: Some businesses of the services are not interrupted during patch installation or uninstallation. + +:ref:`Table 1 ` provides services and instances that support or do not support rolling restart in the MRS cluster. + +.. _mrs_01_0577__en-us_topic_0143479582_table054720341161: + +.. table:: **Table 1** Services and instances that support or do not support rolling restart + + ========= ================ ================================== + Service Instance Whether to Support Rolling Restart + ========= ================ ================================== + HDFS NameNode Yes + \ ZKFC + \ JournalNode + \ HttpFS + \ DataNode + Yarn ResourceManager Yes + \ NodeManager + Hive MetaStore Yes + \ WebHCat + \ HiveServer + MapReduce JobHistoryServer Yes + HBase HMaster Yes + \ RegionServer + \ ThriftServer + \ RESTServer + Spark JobHistory Yes + \ JDBCServer + \ SparkResource No + Hue Hue No + Tez TezUI No + Loader Sqoop No + ZooKeeper QuorumPeer Yes + Kafka Broker Yes + \ MirrorMaker No + Flume Flume Yes + \ MonitorServer + Storm Nimbus Yes + \ UI + \ Supervisor + \ LogViewer + ========= ================ ================================== + +Installing a Patch +------------------ + +#. Log in to the MRS management console. +#. Choose **Clusters** > **Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patch Information** page, click **Install** in the **Operation** column. +#. On the **Warning** page, enable or disable **Rolling Patch**. + + .. note:: + + - Enabling the rolling patch installation function: Services are not stopped before patch installation, and rolling service restart is performed after the patch installation. This minimizes the impact on cluster services but takes more time than common patch installation. + - Disabling the rolling patch uninstallation function: All services are stopped before patch uninstallation, and all services are restarted after the patch uninstallation. This temporarily interrupts the cluster and the services but takes less time than rolling patch uninstallation. + - The rolling patch installation function is not available in clusters with less than two Master nodes and three Core nodes. + +#. Click **OK** to install the target patch. +#. View the patch installation progress. + + a. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + b. Choose **System** > **Manage Patch**. On the **Manage Patch** page, you can view the patch installation progress. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. + +Uninstalling a Patch +-------------------- + +#. Log in to the MRS management console. +#. Choose **Clusters** > **Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. +#. On the **Patch Information** page, click **Uninstall** in the **Operation** column. +#. On the **Warning** page, enable or disable **Rolling Patch**. + + .. note:: + + - Enabling the rolling patch uninstallation function: Services are not stopped before patch uninstallation, and rolling service restart is performed after the patch uninstallation. This minimizes the impact on cluster services but takes more time than common patch uninstallation. + - Disabling the rolling patch uninstallation function: All services are stopped before patch uninstallation, and all services are restarted after the patch uninstallation. This temporarily interrupts the cluster and the services but takes less time than rolling patch uninstallation. + - The rolling patch uninstallation function is not available in clusters with less than two Master nodes and three Core nodes. + +#. Click **OK** to uninstall the target patch. +#. View the patch uninstallation progress. + + a. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. + b. Choose **System** > **Manage Patch**. On the **Manage Patch** page, you can view the patch uninstallation progress. + + .. note:: + + For the isolated host nodes in the cluster, follow instructions in :ref:`Restoring Patches for the Isolated Hosts ` to restore the patch. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/changing_the_password_of_an_operation_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/changing_the_password_of_an_operation_user.rst new file mode 100644 index 0000000..52e6778 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/changing_the_password_of_an_operation_user.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_0427.html + +.. _mrs_01_0427: + +Changing the Password of an Operation User +========================================== + +Scenario +-------- + +Passwords of **Human-Machine** system users must be regularly changed to ensure MRS cluster security. This section describes how to change your passwords on MRS Manager. + +If a new password policy needs to be used for the password modified by the user, follow instructions in :ref:`Modifying a Password Policy ` to modify the password policy and then perform the following operations to modify the password. + +Impact on the System +-------------------- + +If you have downloaded a user authentication file, download it again and obtain the keytab file after changing the password of the MRS cluster user. + +Prerequisites +------------- + +- You have obtained the current password policies from the administrator. +- You have obtained the MRS Manager access address from the administrator. + +Procedure +--------- + +#. On MRS Manager, move the mouse cursor to |image1| in the upper right corner. + + On the menu that is displayed, select **Change Password**. + +#. Fill in the **Old Password**, **New Password**, and **Confirm Password**. Click **OK**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +.. |image1| image:: /_static/images/en-us_image_0000001295898372.jpg diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_role.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_role.rst new file mode 100644 index 0000000..992ace4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_role.rst @@ -0,0 +1,208 @@ +:original_name: mrs_01_0420.html + +.. _mrs_01_0420: + +Creating a Role +=============== + +Scenario +-------- + +This section describes how to create a role on MRS Manager and authorize and manage Manager and components. + +Up to 1,000 roles can be created on MRS Manager. + +Prerequisites +------------- + +You have learned service requirements. + +Procedure +--------- + +#. On MRS Manager, choose **System** > **Manage Role**. + +#. Click **Create Role** and fill in **Role Name** and **Description**. + + **Role Name** is mandatory and contains 3 to 30 digits, letters, and underscores (_). **Description** is optional. + +#. In **Permission**, set role permission. + + a. Click **Service Name** and select a name in **View Name**. + b. Select one or more permissions. + + .. note:: + + - The **Permission** parameter is optional. + - If you select **View Name** to set component permissions, you can enter a resource name in the **Search** box in the upper right corner and click |image1|. The search result is displayed. + - The search scope covers only directories with current permissions. You cannot search subdirectories. Search by keywords supports fuzzy match and is case-insensitive. Results of the next page can be searched. + + .. table:: **Table 1** Manager permission description + + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+========================================================================================================================================+ + | **Alarm** | Authorizes the Manager alarm function. You can select **View** to view alarms and **Management** to manage alarms. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Audit** | Authorizes the Manager audit log function. You can select **View** to view audit logs and **Management** to manage audit logs. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Dashboard** | Authorizes the Manager overview function. You can select **View** to view the cluster overview. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Hosts** | Authorizes the node management function. You can select **View** to view node information and **Management** to manage nodes. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Services** | Authorizes the service management function. You can select **View** to view service information and **Management** to manage services. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_cluster_management** | Authorizes the MRS cluster management function. You can select **Management** to use the MRS patch management function. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_configuration** | Authorizes the MRS cluster configuration function. You can select **Management** to configure MRS clusters on Manager. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **System_task** | Authorizes the MRS cluster task function. You can select **Management** to manage periodic tasks of MRS clusters on Manager. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + | **Tenant** | Authorizes the Manager multi-tenant management function. You can select **Management** to manage multi-tenants. | + +-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+ + + .. table:: **Table 2** HBase permission description + + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+===================================================================================================================+ + | **SUPER_USER_GROUP** | Grants you HBase administrator rights. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Global** | HBase resource type, indicating the whole HBase. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Namespace** | HBase resource type, indicating namespace, which is used to store HBase tables. It has the following permissions: | + | | | + | | - **Admin** permission to manage the namespace | + | | - **Create**: permission to create HBase tables in the namespace | + | | - **Read**: permission to access the namespace | + | | - **Write**: permission to write data to the namespace | + | | - **Execute**: permission to execute the coprocessor (Endpoint) | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Table** | HBase resource type, indicating a data table, which is used to store data. It has the following permissions: | + | | | + | | - **Admin**: permission to manage a data table | + | | - **Create**: permission to create column families and columns in a data table | + | | - **Read**: permission to read a data table | + | | - **Write**: permission to write data to a data table | + | | - **Execute**: permission to execute the coprocessor (Endpoint) | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **ColumnFamily** | HBase resource type, indicating a column family, which is used to store data. It has the following permissions: | + | | | + | | - **Create**: permission to create columns in a column family | + | | - **Read**: permission to read a column family | + | | - **Write**: permission to write data to a column family | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | **Qualifier** | HBase resource type, indicating a column, which is used to store data. It has the following permissions: | + | | | + | | - **Read**: permission to read a column | + | | - **Write**: permission to write data to a column | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of an HBase resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if **Read** and **Write** permissions are added to the **default** namespace, they are automatically added to the tables, column families, and columns in the namespace. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 3** HDFS permission description + + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+=====================================================================================================================================+ + | **Folder** | HDFS resource type, indicating an HDFS directory, which is used to store files or subdirectories. It has the following permissions: | + | | | + | | - **Read**: permission to access the HDFS directory | + | | - **Write**: permission to write data to the HDFS directory | + | | - **Execute**: permission to perform an operation. It must be selected when you add access or write permission. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + | **Files** | HDFS resource type, indicating a file in HDFS. It has the following permissions: | + | | | + | | - **Read**: permission to access the file | + | | - **Write**: permission to write data to the file | + | | - **Execute**: permission to perform an operation. It must be selected when you add access or write permission. | + +-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ + + Permissions of an HDFS directory of each level are not shared by directory types of sub-levels by default. For example, if **Read** and **Execute** permissions are added to the **tmp** directory, you must select **Recursive** at the same time to add permissions to subdirectories. + + .. table:: **Table 4** Hive permission description + + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+=======================================================================================================================+ + | **Hive Admin Privilege** | Grants you Hive administrator rights. | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | **Database** | Hive resource type, indicating a Hive database, which is used to store Hive tables. It has the following permissions: | + | | | + | | - **Select**: permission to query the Hive database | + | | - **Delete**: permission to perform the deletion operation in the Hive database | + | | - **Insert**: permission to perform the insertion operation in the Hive database | + | | - **Create**: permission to perform the creation operation in the Hive database | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + | **Table** | Hive resource type, indicating a Hive table, which is used to store data. It has the following permissions: | + | | | + | | - **Select**: permission to query the Hive table | + | | - **Delete**: permission to perform the deletion operation in the Hive table | + | | - **Update**: grants users the **Update** permission of the Hive table | + | | - **Insert**: permission to perform the insertion operation in the Hive table | + | | - **Grant of Select**: permission to grant the **Select** permission to other users using Hive statements | + | | - **Grant of Delete**: permission to grant the **Delete** permission to other users using Hive statements | + | | - **Grant of Update**: permission to grant the **Update** permission to other users using Hive statements | + | | - **Grant of Insert**: permission to grant the **Insert** permission to other users using Hive statements | + +-------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of a Hive resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if **Select** and **Insert** permissions are added to the **default** database, they are automatically added to the tables and columns in the database. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 5** Yarn permission description + + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+==================================================================================================================================================+ + | **Cluster Admin Operations** | Grants you Yarn administrator rights. | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **root** | Root queue of Yarn. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Parent Queue** | Yarn resource type, indicating a parent queue containing sub-queues. A root queue is a type of a parent queue. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Leaf Queue** | Yarn resource type, indicating a leaf queue. It has the following permissions: | + | | | + | | - **Submit**: permission to submit jobs in the queue | + | | - **Admin**: permission to manage permissions of the current queue | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + + By default, permissions of a Yarn resource type of each level are shared by resource types of sub-levels. However, the **Recursive** option is not selected by default. For example, if the **Submit** permission is added to the **root** queue, it is automatically added to the sub-queue. Permissions inherited by sub-queues will not be displayed as selected in the **Permission** table. If a child resource is set after the parent resource, the permission of the child resource is the union of the permissions of the parent resource and the current child resource. + + .. table:: **Table 6** Hue permission description + + +-------------------------------------------+-------------------------------------------------+ + | Resource Supporting Permission Management | Permission Setting | + +===========================================+=================================================+ + | **Storage Policy Admin** | Grants you storage policy administrator rights. | + +-------------------------------------------+-------------------------------------------------+ + +#. Click **OK**. Return to **Manage Role**. + +Related Tasks +------------- + +**Modifying a role** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage Role**. +#. In the row of the role to be modified, click **Modify** to modify role information. + + .. note:: + + If you change permissions assigned by the role, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. + +**Deleting a role** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage Role**. +#. In the row of the role to be deleted, click **Delete**. +#. Click **OK**. The role is deleted. + +.. |image1| image:: /_static/images/en-us_image_0000001296217676.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user.rst new file mode 100644 index 0000000..9ebeda3 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user.rst @@ -0,0 +1,69 @@ +:original_name: mrs_01_0422.html + +.. _mrs_01_0422: + +Creating a User +=============== + +Scenario +-------- + +This section describes how to create users on MRS Manager based on site requirements and specify their operation permissions to meet service requirements. + +Up to 1,000 users can be created on MRS Manager. + +If a new password policy needs to be used for a new user's password, follow instructions in :ref:`Modifying a Password Policy ` to modify the password policy and then perform the following operations to create a user. + +Prerequisites +------------- + +Administrators have learned service requirements and created roles and role groups required by service scenarios. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User**. + +#. Above the user list, click **Create User**. + +#. Configure parameters as prompted and enter a username in **User Name**. + + .. note:: + + - If a username exists, you cannot create another username that only differs from the existing username in case. For example, if **User1** has been created, you cannot create **user1**. + - When you use the user you created, enter the correct username, which is case-sensitive. + - **User Name** is mandatory and contains 3 to 20 digits, letters, and underscores (_). + - **root**, **omm**, and **ommdba** are reserved system user. Select another username. + +#. Set **User Type** to either **Human-Machine** or **Machine-Machine**. + + - **Human-Machine** users: used for O&M on MRS Manager and operations on component clients. If you select this user type, you need to enter a password and confirm the password in **Password** and **Confirm Password** accordingly. + - **Machine-Machine** users: used for MRS application development. If you select this user type, you do not need to enter a password, because the password is randomly generated. + +#. In **User Group**, click **Select and Join User Group** to select user groups and add users to them. + + .. note:: + + - If roles have been added to user groups, the users can be granted with permissions of the roles. + - If you want to grant new users with Hive permissions, add the users to the Hive group. + - If a user needs to manage tenant resources, the user group must be assigned the **Manager_tenant** role and the role corresponding to the tenant. + +#. In **Primary Group**, select a group as the primary group for users to create directories and files. The drop-down list contains all groups selected in **User Group**. + +#. In **Assign Rights by Role**, click **Select and Add Role** to add roles for users based on service requirements. + + .. note:: + + - When you create a user, if permissions of a user group that is granted to the user cannot meet service requirements, you can assign other created roles to the user. It takes 3 minutes to make role permissions granted to the new user take effect. + - Adding a role when you create a user can specify the user rights. + - A new user can access WebUIs of HDFS, HBase, Yarn, Spark, and Hue even when roles are not assigned to the user. + +#. In **Description**, provide description based on onsite service requirements. + + **Description** is optional. + +#. Click **OK**. The user is created. + + If a new user is used in the MRS cluster for the first time, for example, used for logging in to MRS Manager or using the cluster client, the password must be changed. For details, see section **Changing the Password of an Operation User**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user_group.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user_group.rst new file mode 100644 index 0000000..1ef00f2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/creating_a_user_group.rst @@ -0,0 +1,59 @@ +:original_name: mrs_01_0421.html + +.. _mrs_01_0421: + +Creating a User Group +===================== + +Scenario +-------- + +This section describes how to create user groups and specify their operation permissions on MRS Manager. Management of single or multiple users can be unified in the user groups. After being added to a user group, users can obtain operation permissions owned by the user group. + +Up to 100 user groups can be created on MRS Manager. + +Prerequisites +------------- + +Administrators have learned service requirements and created roles required by service scenarios. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User Group**. + +#. Above the user group list, click **Create User Group**. + +#. Input **Group Name** and **Description**. + + **Group Name** is mandatory and contains 3 to 20 digits, letters, and underscores (_). **Description** is optional. + +#. In **Role**, click **Select and Add Role** to select and add specified roles. + + If you do not add the roles, the user group you are creating now does not have the permission to use MRS clusters. + +#. Click **OK**. The user group is created. + +Related Tasks +------------- + +**Modifying a user group** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User Group**. +#. In the row of the user group to be modified, click **Modify**. + + .. note:: + + If you change role permissions assigned to the user group, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. + +**Deleting a user group** + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User Group**. +#. In the row of the user group to be deleted, click **Delete**. +#. Click **OK**. The user group is deleted. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/deleting_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/deleting_a_user.rst new file mode 100644 index 0000000..b60aead --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/deleting_a_user.rst @@ -0,0 +1,19 @@ +:original_name: mrs_01_0426.html + +.. _mrs_01_0426: + +Deleting a User +=============== + +Scenario +-------- + +If an MRS cluster user is not required, the administrator can delete the user on MRS Manager. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of the user to be deleted, choose **More** > **Delete**. +#. Click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/downloading_a_user_authentication_file.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/downloading_a_user_authentication_file.rst new file mode 100644 index 0000000..97e854a --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/downloading_a_user_authentication_file.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_0429.html + +.. _mrs_01_0429: + +Downloading a User Authentication File +====================================== + +Scenario +-------- + +When a user develops big data applications and runs them in an MRS cluster that supports Kerberos authentication, the user needs to prepare a user authentication file for accessing the MRS cluster. The keytab file in the authentication file can be used for user authentication. + +This section describes how to download a user authentication file and export the keytab file on MRS Manager. + +.. note:: + + - Before downloading a **Human-machine** user authentication file, change the password for the user on MRS Manager to make the initial password set by the administrator invalid. Otherwise, the exported keytab file cannot be used. For details, see :ref:`Changing the Password of an Operation User `. + - After a user password is changed, the exported keytab file becomes invalid, and you need to export a keytab file again. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of the user for whom you want to export the keytab file, choose **More** > **Download authentication credential** to download the authentication file. After the file is automatically generated, save it to a specified path and keep it properly. +#. Open the authentication file with a decompression program. + + - **user.keytab** indicates a user keytab file used for user authentication. + - **krb5.conf** indicates the configuration file of the authentication server. The application connects to the authentication server according to the configuration file information when authenticating users. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/index.rst new file mode 100644 index 0000000..ddb922b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/index.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0573.html + +.. _mrs_01_0573: + +Permissions Management +====================== + +- :ref:`Creating a Role ` +- :ref:`Creating a User Group ` +- :ref:`Creating a User ` +- :ref:`Modifying User Information ` +- :ref:`Locking a User ` +- :ref:`Unlocking a User ` +- :ref:`Deleting a User ` +- :ref:`Changing the Password of an Operation User ` +- :ref:`Initializing the Password of a System User ` +- :ref:`Downloading a User Authentication File ` +- :ref:`Modifying a Password Policy ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_a_role + creating_a_user_group + creating_a_user + modifying_user_information + locking_a_user + unlocking_a_user + deleting_a_user + changing_the_password_of_an_operation_user + initializing_the_password_of_a_system_user + downloading_a_user_authentication_file + modifying_a_password_policy diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/initializing_the_password_of_a_system_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/initializing_the_password_of_a_system_user.rst new file mode 100644 index 0000000..ce67812 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/initializing_the_password_of_a_system_user.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_0428.html + +.. _mrs_01_0428: + +Initializing the Password of a System User +========================================== + +Scenario +-------- + +This section describes how to initialize a password on MRS Manager if a user forgets the password or the password of a public account needs to be changed regularly. After password initialization, the user must change the password upon the first login. + +Impact on the System +-------------------- + +If you have downloaded a user authentication file, download it again and obtain the keytab file after initializing the password of the MRS cluster user. + +Initializing the Password of a Human-Machine User +------------------------------------------------- + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Manage User**. + +#. Locate the row that contains the user whose password is to be initialized, choose **More** > **Initialize password**, and change the password as prompted. + + In the window that is displayed, enter the password of the current administrator account and click **OK**. Then in **Initialize password**, click **OK**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +Initializing the Password of a Machine-Machine User +--------------------------------------------------- + +#. Prepare a client based on service conditions and log in to the node where the client is installed. + +#. Run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/client**: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to log in to the console as user **kadmin/admin**: + + **kadmin -p kadmin/admin** + + .. note:: + + The default password of user **kadmin/admin** is **KAdmin@123**, which will expire upon your first login. Change the password as prompted and keep the new password secure. + +#. Run the following command to reset the password of a component running user. This operation takes effect for all servers. + + **cpw** *Component running user name* + + For example, **cpw oms/manager**. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/locking_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/locking_a_user.rst new file mode 100644 index 0000000..838d3b5 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/locking_a_user.rst @@ -0,0 +1,23 @@ +:original_name: mrs_01_0424.html + +.. _mrs_01_0424: + +Locking a User +============== + +This section describes how to lock users in MRS clusters. A locked user cannot log in to MRS Manager or perform security authentication in the cluster. + +A locked user can be unlocked by an administrator manually or until the lock duration expires. You can lock a user by using either of the following methods: + +- Automatic lock: Set **Number of Password Retries** in **Configure Password Policy**. If user login attempts exceed the parameter value, the user is automatically locked. For details, see :ref:`Modifying a Password Policy `. +- Manual lock: The administrator manually locks a user. + +The following describes how to manually lock a user. **Machine-Machine** users cannot be locked. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of the user to be locked, click **Lock User**. +#. In the window that is displayed, click **Yes** to lock the user. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_a_password_policy.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_a_password_policy.rst new file mode 100644 index 0000000..bc23940 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_a_password_policy.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_0430.html + +.. _mrs_01_0430: + +Modifying a Password Policy +=========================== + +Scenario +-------- + +This section describes how to set password and user login security rules as well as user lock rules. Password policies set on MRS Manager take effect for **Human-machine** users only, because the passwords of **Machine-machine** users are randomly generated. + +If a new password policy needs to be used for a new user's password or the password modified by the user, perform the following operations to modify the password policy first, and then create a user or change the password by following instructions in :ref:`Creating a User ` or :ref:`Changing the Password of an Operation User `. + +.. important:: + + Modify password policies based on service security requirements, because they involve user management security. Otherwise, security risks may be caused. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. Click **Configure Password Policy**. +#. Modify password policies as prompted. For parameter details, see the following table: + + .. table:: **Table 1** Password policy parameter description + + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==============================================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | **Minimum Password Length** | Indicates the minimum number of characters a password contains. The value ranges from 8 to 32. The default value is **8**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Number of Character Types** | Indicates the minimum number of character types a password contains. The character types are uppercase letters, lowercase letters, digits, spaces, and special characters (:literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=`). The value can be **3** or **4**. The default value **3** indicates that the password must contain at least three types of the following characters: uppercase letters, lowercase letters, digits, special characters, and spaces. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Password Validity Period (days)** | Indicates the validity period (days) of a password. The value ranges from 0 to 90. 0 means that the password is permanently valid. The default value is **90**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Password Expiration Notification Days** | Indicates the number of days in advance users are notified that their passwords are about to expire. After the value is set, if the difference between the cluster time and the password expiration time is smaller than this value, the user receives password expiration notifications. When a user logs in to MRS Manager, a message is displayed, indicating that the password is about to expire and asking the user whether to change the password. The value ranges from **0** to *X* (*X* must be set to the half of the password validity period and rounded down). Value **0** indicates that no notification is sent. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Interval of Resetting Authentication Failure Count (min)** | Indicates the interval of retaining incorrect password attempts, in minutes. The value ranges from 0 to 1440. 0 indicates that incorrect password attempts are permanently retained and 1440 indicates that incorrect password attempts are retained for one day. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Number of Password Retries** | Indicates the number of consecutive wrong passwords allowed before the system locks the user. The value ranges from 3 to 30. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Account Lock Duration (min)** | Indicates the time period for which a user is locked when the user lockout conditions are met. The value ranges from 5 to 120. The default value is **5**. | + +--------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_user_information.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_user_information.rst new file mode 100644 index 0000000..bd8da6b --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/modifying_user_information.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0423.html + +.. _mrs_01_0423: + +Modifying User Information +========================== + +Scenario +-------- + +This section describes how to modify user information on MRS Manager, including information about the user group, primary group, role, and description. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of the user to be modified, click **Modify**. + + .. note:: + + If you change user groups for or assign role permissions to the user, it takes 3 minutes to make new configurations take effect. + +#. Click **OK**. The modification is complete. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/unlocking_a_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/unlocking_a_user.rst new file mode 100644 index 0000000..f4caa85 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/permissions_management/unlocking_a_user.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0425.html + +.. _mrs_01_0425: + +Unlocking a User +================ + +If a user is locked because the number of login attempts exceeds the value of **Number of Password Retries**, or the user is manually locked by the administrator, the administrator can unlock the user on MRS Manager. + +Procedure +--------- + +#. On MRS Manager, click **System**. +#. In the **Permission** area, click **Manage User**. +#. In the row of the user to be unlocked, click **Unlock User**. +#. In the window that is displayed, click **Yes** to unlock the user. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/restoring_patches_for_the_isolated_hosts.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/restoring_patches_for_the_isolated_hosts.rst new file mode 100644 index 0000000..dc4b226 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/restoring_patches_for_the_isolated_hosts.rst @@ -0,0 +1,14 @@ +:original_name: mrs_01_0578.html + +.. _mrs_01_0578: + +Restoring Patches for the Isolated Hosts +======================================== + +If some hosts are isolated in a cluster, perform the following operations to restore patches for these isolated hosts after patch installation on other hosts in the cluster. After patch restoration, versions of the isolated host nodes are consistent with those are not isolated. + +#. Access MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Choose **System** > **Manage Patch**. The **Manage Patch** page is displayed. +#. In the **Operation** column, click **View Details**. +#. On the patch details page, select host nodes whose **Status** is **Isolated**. +#. Click **Select and Restore** to restore the isolated host nodes. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/rolling_restart.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/rolling_restart.rst new file mode 100644 index 0000000..aa5dc18 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/rolling_restart.rst @@ -0,0 +1,132 @@ +:original_name: mrs_01_0579.html + +.. _mrs_01_0579: + +Rolling Restart +=============== + +After modifying the configuration items of a big data component, you need to restart the corresponding service to make new configurations take effect. If you use a normal restart mode, all services or instances are restarted concurrently, which may cause service interruption. To ensure that services are not affected during service restart, you can restart services or instances in batches by rolling restart. For instances in active/standby mode, a standby instance is restarted first and then an active instance is restarted. Rolling restart takes longer than normal restart. + +:ref:`Table 1 ` provides services and instances that support or do not support rolling restart in the MRS cluster. + +.. _mrs_01_0579__en-us_topic_0138965094_en-us_topic_0143479582_table054720341161: + +.. table:: **Table 1** Services and instances that support or do not support rolling restart + + ========= ================ ================================== + Service Instance Whether to Support Rolling Restart + ========= ================ ================================== + HDFS NameNode Yes + \ ZKFC + \ JournalNode + \ HttpFS + \ DataNode + Yarn ResourceManager Yes + \ NodeManager + Hive MetaStore Yes + \ WebHCat + \ HiveServer + MapReduce JobHistoryServer Yes + HBase HMaster Yes + \ RegionServer + \ ThriftServer + \ RESTServer + Spark JobHistory Yes + \ JDBCServer + \ SparkResource No + Hue Hue No + Tez TezUI No + Loader Sqoop No + ZooKeeper Quorumpeer Yes + Kafka Broker Yes + \ MirrorMaker No + Flume Flume Yes + \ MonitorServer + Storm Nimbus Yes + \ UI + \ Supervisor + \ Logviewer + ========= ================ ================================== + +Restrictions +------------ + +- Perform a rolling restart during off-peak hours. + + - Otherwise, a rolling restart failure may occur. For example, if the throughput of Kafka is high (over 100 MB/s) during the Kafka rolling restart, the Kafka rolling restart may fail. + - For example, if the requests per second of each RegionServer on the native interface exceed 10,000 during the HBase rolling restart, you need to increase the number of handles to prevent a RegionServer restart failure caused by heavy loads during the restart. + +- Before the restart, check the number of current requests of HBase. If requests of each RegionServer on the native interface exceed 10,000, increase the number of handles to prevent a failure. +- If the number of Core nodes in a cluster is less than six, services may be affected for a short period of time. +- Preferentially perform a rolling instance or service restart and select **Only restart instances whose configurations have expired**. + +.. _mrs_01_0579__en-us_topic_0138965094_section1115494813176: + +Performing a Rolling Service Restart +------------------------------------ + +#. On MRS Manager, click **Services** and select a service for which you want to perform a rolling restart. +#. On the **Service Status** tab page, click **More** and select **Perform Rolling Service Restart**. +#. After you enter the administrator password, the **Perform Rolling Service Restart** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the service. +#. After the rolling restart task is complete, click **Finish**. + +.. _mrs_01_0579__en-us_topic_0138965094_section938837152120: + +Performing a Rolling Instance Restart +------------------------------------- + +#. On MRS Manager, click **Services** and select a service for which you want to perform a rolling restart. +#. On the **Instance** tab page, select the instance to be restarted. Click **More** and select **Perform Rolling Instance Restart**. +#. After you enter the administrator password, the **Perform Rolling Instance Restart** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the instance. +#. After the rolling restart task is complete, click **Finish**. + +.. _mrs_01_0579__en-us_topic_0138965094_section1787148152416: + +Perform a Rolling Cluster Restart +--------------------------------- + +#. On MRS Manager, click **Services**. The **Services** page is displayed. +#. Click **More** and select **Perform Rolling Cluster Restart**. +#. After you enter the administrator password, the **Perform Rolling Cluster Restart** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the cluster. +#. After the rolling restart task is complete, click **Finish**. + +Rolling Restart Parameter Description +------------------------------------- + +:ref:`Table 2 ` describes rolling restart parameters. + +.. _mrs_01_0579__en-us_topic_0138965094_table817615121520: + +.. table:: **Table 2** Rolling restart parameter description + + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==========================================================+===============================================================================================================================================================================================================================================================================+ + | Only restart instances whose configurations have expired | Specifies whether to restart only the modified instances in a cluster. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Node Instances to Be Batch Restarted | Specifies the number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is **1**. The value ranges from 1 to 20. This parameter is valid only for data nodes. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Interval | Specifies the interval between two batches of instances for rolling restart. The default value is **0**. The value ranges from 0 to 2147483647. The unit is second. | + | | | + | | Note: Setting the batch interval parameter can increase the stability of the big data component process during the rolling restart. You are advised to set this parameter to a non-default value, for example, 10. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Batch Fault Tolerance Threshold | Specifies the tolerance times when the rolling restart of instances fails to be executed in batches. The default value is **0**, which indicates that the rolling restart task ends after any batch of instances fails to be restarted. The value ranges from 0 to 214748364. | + +----------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_0579__en-us_topic_0138965094_section830817219322: + +Procedure in a Typical Scenario +------------------------------- + +#. On MRS Manager, click **Services** and select HBase. The HBase service page is displayed. +#. Click the **Service Configuration** tab, and modify an HBase parameter. After the following dialog box is displayed, click **OK** to save the configurations. + + .. note:: + + Do not select **Restart the affected services or instances**. This option indicates a normal restart. If you select this option, all services or instances will be restarted, which may cause service interruption. + +#. After saving the configurations, click **Finish**. +#. Click the **Service Status** tab. +#. On the **Service Status** tab page, click **More** and select **Perform Rolling Service Restart**. +#. After you enter the administrator password, the **Perform Rolling Service Restart** page is displayed. Select **Only restart instances whose configurations have expired** and click **OK** to perform rolling restart for the service. +#. After the rolling restart task is complete, click **Finish**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_database_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_database_user.rst new file mode 100644 index 0000000..548eb17 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_database_user.rst @@ -0,0 +1,43 @@ +:original_name: mrs_01_0569.html + +.. _mrs_01_0569: + +Changing the Password of a Component Database User +================================================== + +Scenario +-------- + +This section describes how to periodically change the password of the component database user to improve the system O&M security. + +Impact on the System +-------------------- + +The services need to be restarted for the new password to take effect. The services are unavailable during the restart. + +Procedure +--------- + +#. On MRS Manager, click **Services** and click the name of the database user service to be modified. + +#. Determine the component database user whose password is to be changed. + + - To change the password of the DBService database user, go to :ref:`3 `. + - To change the password of the Loader, Hive, or Hue database user, stop the service first and then execute :ref:`3 `. + + Click **Stop Service**. + +#. .. _mrs_01_0569__en-us_topic_0042008033_li30220842102536: + + Choose **More** > **Change Password**. + +#. Enter the old and new passwords as prompted. + + The password complexity requirements are as follows: + + - The password of the DBService database user contains 16 to 32 characters. The password of the Loader, Hive, or Hue database user contains 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + - The password cannot be the same as the last 20 historical passwords. + +#. Click **OK**. The system automatically restarts the corresponding service. When **Operation successful** is displayed, click **Finish**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_running_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_running_user.rst new file mode 100644 index 0000000..ecb59e4 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_a_component_running_user.rst @@ -0,0 +1,57 @@ +:original_name: mrs_01_0566.html + +.. _mrs_01_0566: + +Changing the Password of a Component Running User +================================================= + +Scenario +-------- + +This section describes how to periodically change the password of the component running user of the MRS cluster to improve the system O&M security. + +If the initial password is randomly generated by the system, reset the password. + +If the password is changed, the downloaded user credential will be unavailable. Download the authentication credential again, and replace the old one. + +Prerequisites +------------- + +A client has been prepared on the **Master1** node. + +Procedure +--------- + +#. Log in to the **Master1** node. + +#. (Optional) To change the password as user **omm**, run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/client**: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to log in to the console as user **kadmin/admin**: + + **kadmin -p kadmin/admin** + + .. note:: + + The default password of user **kadmin/admin** is **KAdmin@123**, which will expire upon your first login. Change the password as prompted and keep the new password secure. + +#. Run the following command to reset the password of a component running user. This operation takes effect for all servers. + + **cpw** *Component running user name* + + For example, to reset the password of user admin, run the **cpw admin** command. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_an_os_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_an_os_user.rst new file mode 100644 index 0000000..20c040d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_an_os_user.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_0562.html + +.. _mrs_01_0562: + +Changing the Password of an OS User +=================================== + +Scenario +-------- + +This section describes how to periodically change the login passwords of the OS users **omm**, **ommdba**, and **root** on MRS cluster nodes to improve the system O&M security. + +Passwords of users **omm**, **ommdba**, and **root** on each node can be different. + +Procedure +--------- + +#. Log in to the **Master1** node and then log in to other nodes whose OS user passwords need to be changed. + +#. Run the following command to switch to user **root**: + + **sudo su - root** + +3. Run the following command to change the passwords of users **omm**, **ommdba**, or **root**: + + **passwd omm** + + **passwd ommdba** + + **passwd root** + + For example, if you run the **omm:passwd** command, the system displays the following information: + + .. code-block:: + + Changing password for user omm. + New password: + + Enter a new password. The password change policies for an OS vary according to the OS that is used. + + .. code-block:: + + Retype new password: + passwd: all authentication tokens updated successfully. + + .. note:: + + The default password complexity requirements of the MRS cluster are as follows: + + - The password must contain at least eight characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The new password cannot be the same as last five historical passwords. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_data_access_user_of_the_oms_database.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_data_access_user_of_the_oms_database.rst new file mode 100644 index 0000000..b24f6ca --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_data_access_user_of_the_oms_database.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_0568.html + +.. _mrs_01_0568: + +Changing the Password of the Data Access User of the OMS Database +================================================================= + +Scenario +-------- + +This section describes how to periodically change the password of the data access user of the OMS database to improve the system O&M security. + +Impact on the System +-------------------- + +The OMS service needs to be restarted for the new password to take effect. The service is unavailable during the restart. + +Procedure +--------- + +#. On MRS Manager, click **System**. + +#. In the **Permission** area, click **Change OMS Database Password**. + +#. Locate the row that contains user **omm**, and click **Change password** in the **Operation** column. + + The password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + - The password cannot be the same as the last 20 historical passwords. + +#. Click **OK**. When **Operation successful** is displayed, click **Finish**. + +#. Locate the row that contains user **omm**, and click **Restart the OMS service** in the **Operation** column to restart the OMS database. + + .. note:: + + If the password is changed but the OMS database is not restarted, the status of user **omm** changes to **Waiting to restart** and the password cannot be changed until the OMS database is restarted. + +#. In the displayed dialog box, select **I have read the information and understand the impact**. Click **OK**, and restart the OMS service. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_kerberos_administrator.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_kerberos_administrator.rst new file mode 100644 index 0000000..c709ce3 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_kerberos_administrator.rst @@ -0,0 +1,45 @@ +:original_name: mrs_01_0564.html + +.. _mrs_01_0564: + +Changing the Password of the Kerberos Administrator +=================================================== + +Scenario +-------- + +This section describes how to periodically change the password of the Kerberos administrator **kadmin** of the MRS cluster to improve the system O&M security. + +If the password is changed, the downloaded user credential will be unavailable. Download the authentication credential again, and replace the old one. + +Prerequisites +------------- + +A client has been prepared on the **Master1** node. + +Procedure +--------- + +#. Log in to the **Master1** node. + +#. (Optional) To change the password as user **omm**, run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/client**. + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to change the password of **kadmin/admin**. This operation takes effect for all servers. + + **kpasswd kadmin/admin** + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain at least eight characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_oms_database_administrator.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_oms_database_administrator.rst new file mode 100644 index 0000000..2953f5d --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_the_oms_database_administrator.rst @@ -0,0 +1,47 @@ +:original_name: mrs_01_0567.html + +.. _mrs_01_0567: + +Changing the Password of the OMS Database Administrator +======================================================= + +Scenario +-------- + +This section describes how to periodically change the password of the OMS database administrator to improve the system O&M security. + +Procedure +--------- + +#. Log in to the active management node. + + .. note:: + + The password of user **ommdba** cannot be changed on the standby management node. Otherwise, the cluster may not work properly. Change the password on the active management node only. + +#. Run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch the directory: + + **cd $OMS_RUN_PATH/tools** + +#. Run the following command to change the password of user **ommdba**: + + **mod_db_passwd ommdba** + +#. Enter the old password of user **ommdba** and enter a new password twice. + + The password complexity requirements are as follows: + + - The password contains 16 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + - The password cannot be the same as the last 20 historical passwords. + + If the following information is displayed, the password is changed successfully. + + .. code-block:: + + Congratulations, update [ommdba] password successfully. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_user_admin.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_user_admin.rst new file mode 100644 index 0000000..59d44f6 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_password_of_user_admin.rst @@ -0,0 +1,97 @@ +:original_name: mrs_01_0563.html + +.. _mrs_01_0563: + +Changing the password of user **admin** +======================================= + +This section describes how to periodically change the password of cluster user **admin** to improve the system O&M security. + +If the password is changed, the downloaded user credential will be unavailable. Download the authentication credential again, and replace the old one. + +Changing the Password of User admin on the Cluster Node +------------------------------------------------------- + +#. Update the client of the active management node. For details, see :ref:`Updating a Client (Versions Earlier Than 3.x) `. + +#. Log in to the active management node. + +#. (Optional) To change the password as user **omm**, run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/client**. + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to change the password of user **admin**: This operation takes effect in the whole cluster. + + **kpasswd admin** + + Enter the old password and then enter a new password twice. + + For the MRS 1.6.3 or later cluster, the default password complexity requirements are as follows: + + - The password must contain at least eight characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +Changing the Password of User admin on MRS Manager +-------------------------------------------------- + +You can change the password of user **admin** on MRS Manager only for clusters with Kerberos authentication enabled and clusters with Kerberos authentication disabled but the EIP function enabled. + +#. Log in to MRS Manager as user **admin**. +#. Click the username in the upper right corner of the page and choose **Change Password**. +#. On the **Change Password** page, set **Old Password**, **New Password**, and **Confirm Password**. + + .. note:: + + The default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. + +#. Click **OK**. Log in to MRS Manager with the new password. + +Resetting the Password for User **admin** +----------------------------------------- + +#. Log in to the **Master1** node. + +#. (Optional) To change the password as user **omm**, run the following command to switch the user: + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/client**: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to log in to the console as user **kadmin/admin**: + + **kadmin -p kadmin/admin** + + .. note:: + + The default password of user **kadmin/admin** is **KAdmin@123**, which will expire upon your first login. Change the password as prompted and keep the new password secure. + +#. Run the following command to reset the password of a component running user. This operation takes effect for all servers. + + **cpw** *Component running user name* + + For example, to reset the password of user admin, run the **cpw admin** command. + + For the cluster, the default password complexity requirements are as follows: + + - The password must contain 8 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, spaces, and special characters (``'~!@#$%^&*()-_=+\|[{}];:'",<.>/?``). + - The password cannot be the username or the reverse username. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user.rst new file mode 100644 index 0000000..b2b0c1f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0565.html + +.. _mrs_01_0565: + +Changing the Passwords of the LDAP Administrator and the LDAP User +================================================================== + +Scenario +-------- + +This section describes how to periodically change the passwords of the LDAP administrator **rootdn:cn=root,dc=hadoop,dc=com** and the LDAP user **pg_search_dn:cn=pg_search_dn,ou=Users,dc=hadoop,dc=com** to improve the system O&M security. + +Impact on the System +-------------------- + +All services need to be restarted for the new password to take effect. The services are unavailable during the restart. + +Procedure +--------- + +#. On MRS Manager, choose **Services > LdapServer > More**. + +#. Click **Change Password**. + +#. In the **Change Password** dialog box, select the user whose password needs to be modified in the **User Information** drop-down box. + +#. Enter the old password in the **Old Password** text box, and enter the new password in the **New Password** and **Confirm Password** text boxes. + + The default password complexity requirements are as follows: + + - The password contains 16 to 32 characters. + - The password must contain at least three types of the following: uppercase letters, lowercase letters, digits, and special characters (:literal:`\`~!@#$%^&*()-_=+\\|[{}];:",<.>/?`). + - The password cannot be the username or the reverse username. + - The new password cannot be the same as the current password. + + .. note:: + + The default password of the LDAP administrator **rootdn:cn=root,dc=hadoop,dc=com** is **LdapChangeMe@123**, and that of the LDAP user **pg_search_dn:cn=pg_search_dn,ou=Users,dc=hadoop,dc=com** is **pg_search_dn@123**. Periodically change the passwords and keep them secure. + +#. Select **I have read the information and understand the impact**, and click **OK** to confirm the modification and restart the service. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_disabled.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_disabled.rst new file mode 100644 index 0000000..d6cbdcd --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_disabled.rst @@ -0,0 +1,113 @@ +:original_name: mrs_01_0561.html + +.. _mrs_01_0561: + +Default Users of Clusters with Kerberos Authentication Disabled +=============================================================== + +User Classification +------------------- + +The MRS cluster provides the following two types of users. Users are advised to periodically change the passwords. It is not recommended to use the default passwords. + ++-----------------------------------+-----------------------------------------------------------------------------------+ +| User Type | Description | ++===================================+===================================================================================+ +| System users | User who runs OMS processes | ++-----------------------------------+-----------------------------------------------------------------------------------+ +| Database users | - User who manages OMS database and accesses data | +| | - User who runs the database of service components (Hive, Loader, and DBService) | ++-----------------------------------+-----------------------------------------------------------------------------------+ + +System users +------------ + +.. note:: + + - User **Idap** of the OS is required in the MRS cluster. Do not delete this account. Otherwise, the cluster may not work properly. Password management policies are maintained by the operation users. + - Reset the passwords when you change the passwords of user **ommdba** and user **omm** for the first time. Change the passwords periodically after retrieving them. + ++-----------------------------------------+-----------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| Operation | Username | Initial Password | Description | ++=========================================+=================+===================================================+=========================================================================================================================================+ +| System administrator of the MRS cluster | admin | Specified by the user during the cluster creation | - Default user of MRS Manager to record cluster audit logs for versions earlier than MRS 1.8.0. | +| | | | | +| | | | - For clusters of 1.8.0 and later versions, the administrator password of MRS Manager is specified by users during cluster creation. | +| | | | | +| | | | This user also has the following permissions: | +| | | | | +| | | | - Common HDFS and ZooKeeper user permissions. | +| | | | - Permissions to submit and query MapReduce and Yarn tasks, manage Yarn queues, and access the Yarn web UI. | +| | | | - Permissions to submit, query, activate, deactivate, reassign, delete topologies, and operate all topologies of the Storm service. | +| | | | - Permissions to create, delete, authorize, reassign, consume, write, and query topics of the Kafka service. | ++-----------------------------------------+-----------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | omm | Randomly generated by the system | Internal running user of the MRS cluster system. This user is an OS user generated on all node and does not require a unified password. | ++-----------------------------------------+-----------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | root | Set by the user | User for logging in to the node in the MRS cluster. This user is an OS user generated on all nodes. | +| | | | | +| | | .. note:: | | +| | | | | +| | | Applicable to clusters of MRS 1.6.2 and later. | | ++-----------------------------------------+-----------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + +User Group Information +---------------------- + ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Default User Group | Description | ++=======================+=================================================================================================================================================================================================================================+ +| supergroup | Primary group of user **admin**, which has no additional permissions in the cluster with Kerberos authentication disabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| check_sec_ldap | Used to test whether the active LDAP works properly. This user group is generated randomly in a test and automatically deleted after the test is complete. which is an internal system user group used only between components. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_tenant | Tenant system user group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| System_administrator | MRS cluster system administrator group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_viewer | MRS Manager system viewer group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_operator | MRS Manager system operator group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_auditor | MRS Manager system auditor group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_administrator | MRS Manager system administrator group, which is an internal system user group used only between components. It is used only in clusters with Kerberos authentication enabled. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| compcommon | MRS cluster internal group, used to access public resources in the cluster. All system users and system running users are added to this user group by default. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| default_1000 | User group created for tenants, which is an internal system user group used only between components. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| launcher-job | MRS internal group, which is used to submit jobs using V2 APIs. | ++-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| OS User Group | Description | ++===============+==================================================================================================================================+ +| wheel | Primary group of MRS internal running user **omm**. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| ficommon | MRS cluster common group that corresponds to **compcommon** for accessing public resource files stored in the OS of the cluster. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ + +Database users +-------------- + +MRS cluster system database users include OMS database users and DBService database users. + +.. note:: + + Do not delete database users. Otherwise, the cluster or components may not work properly. + ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| Operation | Default User | Initial Password | Description | ++====================+==============+===================+===========================================================================================================+ +| OMS database | ommdba | dbChangeMe@123456 | OMS database administrator who performs maintenance operations, such as creating, starting, and stopping. | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| | omm | ChangeMe@123456 | User for accessing OMS database data | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| DBService database | omm | dbserverAdmin@123 | Administrator of the GaussDB database in the DBService component | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| | hive | HiveUser@ | User for Hive to connect to the DBService database | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| | hue | HueUser@123 | User for Hue to connect to the DBService database | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ +| | sqoop | SqoopUser@ | User for Loader to connect to the DBService database. | ++--------------------+--------------+-------------------+-----------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..a4cc5fb --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/default_users_of_clusters_with_kerberos_authentication_enabled.rst @@ -0,0 +1,186 @@ +:original_name: mrs_01_24044.html + +.. _mrs_01_24044: + +Default Users of Clusters with Kerberos Authentication Enabled +============================================================== + +User Classification +------------------- + +The MRS cluster provides the following three types of users. Users are advised to periodically change the passwords. It is not recommended to use the default passwords. + ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| User Type | Description | ++===================================+===================================================================================================================+ +| System user | - User created on Manager for MRS cluster O&M and service scenarios. There are two types of users: | +| | | +| | - **Human-machine** user: used for Manager O&M scenarios and component client operation scenarios. | +| | - **Machine-machine** user: used for MRS cluster application development scenarios. | +| | | +| | - User who runs OMS processes. | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| Internal system user | Internal user who performs process communications, saves user group information, and associates user permissions. | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ +| Database user | - User who manages OMS database and accesses data. | +| | - User who runs the database of service components (Hive, Hue, Loader, and DBService) | ++-----------------------------------+-------------------------------------------------------------------------------------------------------------------+ + +System User +----------- + +.. note:: + + - User **Idap** of the OS is required in the MRS cluster. Do not delete this account. Otherwise, the cluster may not work properly. Password management policies are maintained by the operation users. + - Reset the passwords when you change the passwords of user **ommdba** and user **omm** for the first time. Change the passwords periodically after retrieving them. + ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| Type | Username | Initial Password | Description | ++=========================================+=================+====================================================+==========================================================================================================================================+ +| System administrator of the MRS cluster | admin | Specified by the user during the cluster creation. | Manager administrator | +| | | | | +| | | | with the following permissions: | +| | | | | +| | | | - Common HDFS and ZooKeeper user permissions. | +| | | | - Permissions to submit and query MapReduce and Yarn tasks, manage Yarn queues, and access the Yarn web UI. | +| | | | - Permissions to submit, query, activate, deactivate, reassign, delete topologies, and operate all topologies of the Storm service. | +| | | | - Permissions to create, delete, authorize, reassign, consume, write, and query topics of the Kafka service. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | omm | Randomly generated by the system. | Internal running user of the MRS cluster system. This user is an OS user generated on all nodes and does not require a unified password. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| MRS cluster node OS user | root | Set by the user. | User for logging in to the node in the MRS cluster. This user is an OS user generated on all nodes. | ++-----------------------------------------+-----------------+----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+ + +Internal System Users +--------------------- + +.. note:: + + Do not delete the following internal system users. Otherwise, the cluster or components may not work properly. + ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| Type | Default User | Initial Password | Description | ++========================+=================+==================+===============================================================================================================================+ +| Component running user | hdfs | Hdfs@123 | This user is the HDFS system administrator and has the following permissions: | +| | | | | +| | | | #. File system operation permissions: | +| | | | | +| | | | - Views, modifies, and creates files. | +| | | | - Views and creates directories. | +| | | | - Views and modifies the groups where files belong. | +| | | | - Views and sets disk quotas for users. | +| | | | | +| | | | #. HDFS management operation permissions: | +| | | | | +| | | | - Views the web UI status. | +| | | | - Views and sets the active and standby HDFS status. | +| | | | - Enters and exits the HDFS in security mode. | +| | | | - Checks the HDFS file system. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | hbase | Hbase@123 | This user is the HBase system administrator and has the following permissions: | +| | | | | +| | | | - Cluster management permission: **Enable** and **Disable** operations on tables to trigger MajorCompact and ACL operations. | +| | | | - Grants and revokes permissions, and shuts down the cluster. | +| | | | - Table management permission: Creates, modifies, and deletes tables. | +| | | | - Data management permission: Reads and writes data in tables, column families, and columns. | +| | | | - Accesses the HBase web UI. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | mapred | Mapred@123 | This user is the MapReduce system administrator and has the following permissions: | +| | | | | +| | | | - Submits, stops, and views the MapReduce tasks. | +| | | | - Modifies the Yarn configuration parameters. | +| | | | - Accesses the Yarn and MapReduce web UI. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ +| | spark | Spark@123 | This user is the Spark system administrator and has the following permissions: | +| | | | | +| | | | - Accesses the Spark web UI. | +| | | | - Submits Spark tasks. | ++------------------------+-----------------+------------------+-------------------------------------------------------------------------------------------------------------------------------+ + +User Group Information +---------------------- + ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Default User Group | Description | ++=======================+================================================================================================================================================================================================================================+ +| hadoop | Users added to this user group have the permission to submit tasks to all Yarn queues. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| hbase | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| hive | Users added to this user group can use Hive. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| spark | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| supergroup | Users added to this user group can have the administrator permission of HBase, HDFS, and Yarn and can use Hive. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| check_sec_ldap | Used to test whether the active LDAP works properly. This user group is generated randomly in a test and automatically deleted after the test is complete. This is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_tenant | Tenant system user group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| System_administrator | MRS cluster system administrator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_viewer | MRS Manager system viewer group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_operator | MRS Manager system operator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_auditor | MRS Manager system auditor group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Manager_administrator | MRS Manager system administrator group, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| compcommon | Internal system group for accessing public resources in a cluster. All system users and system running users are added to this user group by default. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| default_1000 | User group created for tenants, which is an internal system user group used only between components. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafka | Kafka common user group. Users added to this group need to be granted with read and write permission by users in the **kafkaadmin** group before accessing the desired topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafkasuperuser | Users added to this group have permissions to read data from and write data to all topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| kafkaadmin | Kafka administrator group. Users added to this group have the permissions to create, delete, authorize, as well as read from and write data to all topics. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| storm | Storm common user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| stormadmin | Storm administrator user group. Users added to this group have the permissions to submit topologies and manage their own topologies. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| opentsdb | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| presto | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| flume | Common user group. Users added to this user group will not have any additional permission. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| launcher-job | MRS internal group, which is used to submit jobs using V2 APIs. | ++-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| OS User Group | Description | ++===============+==================================================================================================================================+ +| wheel | Primary group of MRS internal running user **omm**. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ +| ficommon | MRS cluster common group that corresponds to **compcommon** for accessing public resource files stored in the OS of the cluster. | ++---------------+----------------------------------------------------------------------------------------------------------------------------------+ + +Database User +------------- + +MRS cluster system database users include OMS database users and DBService database users. + +.. note:: + + Do not delete database users. Otherwise, the cluster or components may not work properly. + ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| Type | Default User | Initial Password | Description | ++====================+==============+===================+========================================================================================================================+ +| OMS database | ommdba | dbChangeMe@123456 | OMS database administrator who performs maintenance operations, such as creating, starting, and stopping applications. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | omm | ChangeMe@123456 | User for accessing OMS database data. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| DBService database | omm | dbserverAdmin@123 | Administrator of the GaussDB database in the DBService component. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | hive | HiveUser@ | User for Hive to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | hue | HueUser@123 | User for Hue to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | sqoop | SqoopUser@ | User for Loader to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ +| | ranger | RangerUser@ | User for Ranger to connect to the DBService database. | ++--------------------+--------------+-------------------+------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/index.rst new file mode 100644 index 0000000..b89c4af --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/index.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0560.html + +.. _mrs_01_0560: + +Security Management +=================== + +- :ref:`Default Users of Clusters with Kerberos Authentication Disabled ` +- :ref:`Default Users of Clusters with Kerberos Authentication Enabled ` +- :ref:`Changing the Password of an OS User ` +- :ref:`Changing the password of user admin ` +- :ref:`Changing the Password of the Kerberos Administrator ` +- :ref:`Changing the Passwords of the LDAP Administrator and the LDAP User ` +- :ref:`Changing the Password of a Component Running User ` +- :ref:`Changing the Password of the OMS Database Administrator ` +- :ref:`Changing the Password of the Data Access User of the OMS Database ` +- :ref:`Changing the Password of a Component Database User ` +- :ref:`Replacing the HA Certificate ` +- :ref:`Updating Cluster Keys ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + default_users_of_clusters_with_kerberos_authentication_disabled + default_users_of_clusters_with_kerberos_authentication_enabled + changing_the_password_of_an_os_user + changing_the_password_of_user_admin + changing_the_password_of_the_kerberos_administrator + changing_the_passwords_of_the_ldap_administrator_and_the_ldap_user + changing_the_password_of_a_component_running_user + changing_the_password_of_the_oms_database_administrator + changing_the_password_of_the_data_access_user_of_the_oms_database + changing_the_password_of_a_component_database_user + replacing_the_ha_certificate + updating_cluster_keys diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/replacing_the_ha_certificate.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/replacing_the_ha_certificate.rst new file mode 100644 index 0000000..a5f73fa --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/replacing_the_ha_certificate.rst @@ -0,0 +1,93 @@ +:original_name: mrs_01_0571.html + +.. _mrs_01_0571: + +Replacing the HA Certificate +============================ + +Scenario +-------- + +HA certificates are used to encrypt the communication between active/standby processes and HA processes to ensure the communication security. This section describes how to replace the HA certificates on the active and standby management nodes on MRS Manager to ensure the product security. + +The certificate file and key file can be generated by the user. + +Impact on the System +-------------------- + +MRS Manager needs to be restarted during the replacement and cannot be accessed or provide services at that time. + +Prerequisites +------------- + +- You have obtained the **root-ca.crt** HA root certificate file and the **root-ca.pem** key file to be replaced. + +- You have prepared a password, such as **Userpwd@123**, for accessing the key file. + + To avoid potential security risks, the password must meet the following complexity requirements: + + - The password must contain at least eight characters. + - The password must contain at least four types of the following characters: uppercase letters, lowercase letters, digits, and special characters (:literal:`~`!?,.:;-_'(){}[]/<>@#$%^&*+|\\=`). + +Procedure +--------- + +#. Log in to the active management node. + +#. Run the following commands to switch the user: + + **sudo su - root** + + **su - omm** + +#. Run the following commands to generate **root-ca.crt** and **root-ca.pem** in the **${OMS_RUN_PATH}/workspace0/ha/local/cert** directory on the active management node: + + **sh ${OMS_RUN_PATH}/workspace/ha/module/hacom/script/gen-cert.sh --root-ca --country=**\ *country* **--state=**\ *state* **--city=**\ *city* **--company=**\ *company* **--organize=**\ *organize* **--common-name=**\ *commonname* **--email=**\ *Administrator email address* **--password=**\ *password* + + For example, run the following command: **sh ${OMS_RUN_PATH}/workspace/ha/module/hacom/script/gen-cert.sh --root-ca --country=DE --state=eur --city=ber --company=dt --organize=IT --common-name=HADOOP.COM --email=abc@dt.com --password=Userpwd@123** + + The command has been executed successfully if the following information is displayed: + + .. code-block:: + + Generate root-ca pair success. + +#. On the active management node, run the following command as user **omm** to copy **root-ca.crt** and **root-ca.pem** to the **${BIGDATA_HOME}/om-0.0.1/security/certHA** directory: + + **cp -arp ${OMS_RUN_PATH}/workspace0/ha/local/cert/root-ca.\* ${BIGDATA_HOME}/om-0.0.1/security/certHA** + +#. Copy **root-ca.crt** and **root-ca.pem** generated on the active management node to the **${BIGDATA_HOME}/om-0.0.1/security/certHA** directory on the standby management node as user **omm**. + +#. .. _mrs_01_0571__en-us_topic_0042008035_li61539631113353: + + Run the following command to generate an HA certificate and perform the automatic replacement: + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/replacehaSSLCert.sh** + + Enter the password as prompted, and press **Enter**. + + .. code-block:: + + Please input ha ssl cert password: + + The HA certificate is replaced successfully if the following information is displayed: + + .. code-block:: + + [INFO] Succeed to replace ha ssl cert. + +#. .. _mrs_01_0571__en-us_topic_0042008035_li61839614113353: + + Run the following command to restart OMS: + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/restart-oms.sh** + + The following information is displayed: + + .. code-block:: + + start HA successfully. + +#. Log in to the standby management node and switch to user **omm**. Repeat step :ref:`6 ` to step :ref:`7 `. + + Run the **sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh** command to check whether **HAAllResOK** of the management node is **Normal**. Access MRS Manager again. If MRS Manager can be accessed, the operation is successful. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/updating_cluster_keys.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/updating_cluster_keys.rst new file mode 100644 index 0000000..45ca0af --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/security_management/updating_cluster_keys.rst @@ -0,0 +1,67 @@ +:original_name: mrs_01_0572.html + +.. _mrs_01_0572: + +Updating Cluster Keys +===================== + +Scenario +-------- + +When a cluster is installed, an encryption key is generated automatically to store the security information in the cluster (such as all database user passwords and key file access passwords) in encryption mode. After the cluster is successfully installed, you are advised to periodically update the encryption key based on the following procedure. + +Impact on the System +-------------------- + +- After a cluster key is updated, a new key is generated randomly in the cluster. This key is used to encrypt and decrypt the newly stored data. The old key is not deleted, and it is used to decrypt data encrypted using the old key. After security information is modified, for example, a database user password is changed, the new password is encrypted using the new key. +- When the key is updated, the cluster is stopped and cannot be accessed. + +Prerequisites +------------- + +The upper-layer applications depending on the cluster are stopped. + +Procedure +--------- + +#. Log in to MRS Manager and choose **Services** > **More** > **Stop Cluster**. + + In the displayed dialog box, select **I have read the information and understand the impact.** Click **OK**. Wait until the system displays a message indicating that the operation is successful. Click **Finish**. The cluster is stopped successfully. + +#. Log in to the active management node. + +#. Run the following commands to switch the user: + + **sudo su - omm** + +#. Run the following command to disable logout upon timeout: + + **TMOUT=0** + +#. Run the following command to switch the directory: + + **cd ${BIGDATA_HOME}/om-0.0.1/tools** + +#. Run the following command to update the cluster key: + + **sh updateRootKey.sh** + + Enter **y** as prompted. + + .. code-block:: + + The root key update is a critical operation. + Do you want to continue?(y/n): + + The key is updated successfully if the following information is displayed: + + .. code-block:: + + ... + Step 4-1: The key save path is obtained successfully. + ... + Step 4-4: The root key is sent successfully. + +#. On MRS Manager, choose **Services > More > Start Cluster**. + + In the displayed dialog box, click **OK**. After **Operation successful** is displayed, click **Finish**. The cluster is started. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/configuring_a_static_service_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/configuring_a_static_service_pool.rst new file mode 100644 index 0000000..35f9690 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/configuring_a_static_service_pool.rst @@ -0,0 +1,123 @@ +:original_name: mrs_01_0536.html + +.. _mrs_01_0536: + +Configuring a Static Service Pool +================================= + +Scenario +-------- + +If you need to control the node resources that can be used by the cluster service or the CPU usage of the node used by the cluster in different time periods, you can adjust the resource base on MRS Manager and customize the resource configuration groups. + +Prerequisites +------------- + +- After the static service pool is configured, the HDFS and YARN services need to be restarted. During the restart, the services are unavailable. +- After a static service pool is configured, the maximum number of resources used by each service and role instance cannot exceed the upper limit. + +Procedure +--------- + +#. Modify the system resource adjustment base. + + a. On MRS Manager, click **System**. In the **Resource** area, click **Configure Static Service Pool**. + + b. Click **Configuration**. The service pool configuration group management page is displayed. + + c. In the **System Resource Adjustment Base** area, change the values of **CPU(%)** and **Memory(%)** . + + Modifying **System Resource Adjustment Base** limits the maximum physical CPU and memory resource percentage of nodes that can be used by the Flume, HBase, HDFS, Impala and YARN services. If multiple services are deployed on the same node, the maximum physical resource usage of all services cannot exceed the adjusted CPU or memory usage. + + d. Click **Next**. + + If you need to modify the parameters again, click **Previous** in the lower part of the page. + +#. Modify the **default** configuration group of the service pool. + + a. Click **default**. In the **Service Pool Configuration** table, set **CPU LIMIT(%)**, **CPU SHARE(%)**, **I/O(%)**, and **Memory(%)** for the Flume, HBase, HDFS, Impala and YARN services. + + .. note:: + + - The sum of **CPU LIMIT(%)** used by all services can exceed 100%. + - The sum of **CPU SHARE(%)** and **I/O(%)** used by all services must be 100%. For example, if CPU resources are allocated to the HDFS and Yarn services, the total CPU resources allocated to the two services are 100%. + - The sum of **Memory(%)** used by all services can be greater than, smaller than, or equal to 100%. + - **Memory(%)** cannot take effect dynamically and can only be modified in the default configuration group. + + b. Click in the blank area of the page to complete the editing. MRS Manager generates the correct values of service pool parameters in the **Detailed Configuration** area based on the cluster hardware resources and allocation information. + + c. You can click |image1| on the right of **Detailed Configuration** to modify the parameter values of the service pool based on service requirements. + + In the **Service Pool Configuration** area, click the specified service name. The **Detailed Configuration** area displays only the parameters of the service. Manual changing of parameter values does not refresh the service resource usage. In added configuration groups, the configuration group numbers of the parameters that take effect dynamically will be displayed. For example, **HBase: RegionServer: dynamic-config1.RES_CPUSET_PERCENTAGE**. The parameter functions do not change. + + .. table:: **Table 1** Parameters of the static service pool + + +------------------------------------------+---------------------------------------------------------------------------------+ + | Parameter | Description | + +==========================================+=================================================================================+ + | - RES_CPUSET_PERCENTAGE | Configures the service CPU percentage. | + | - dynamic-configX.RES_CPUSET_PERCENTAGE | | + +------------------------------------------+---------------------------------------------------------------------------------+ + | - RES_CPU_SHARE | Configures the service CPU share. | + | - dynamic-configX.RES_CPU_SHARE | | + +------------------------------------------+---------------------------------------------------------------------------------+ + | - RES_BLKIO_WEIGHT | Configures service I/O usage. | + | - dynamic-configX.RES_BLKIO_WEIGHT | | + +------------------------------------------+---------------------------------------------------------------------------------+ + | HBASE_HEAPSIZE | Configures the maximum JVM memory for RegionServer. | + +------------------------------------------+---------------------------------------------------------------------------------+ + | HADOOP_HEAPSIZE | Configures the maximum JVM memory of a DataNode. | + +------------------------------------------+---------------------------------------------------------------------------------+ + | yarn.nodemanager.resource.memory-mb | Configures the memory that can be used by NodeManager on the current node. | + +------------------------------------------+---------------------------------------------------------------------------------+ + | dfs.datanode.max.locked.memory | Configures the maximum memory that can be used by a DataNode as the HDFS cache. | + +------------------------------------------+---------------------------------------------------------------------------------+ + | FLUME_HEAPSIZE | Configures the maximum JVM memory that can be used by each Flume instance. | + +------------------------------------------+---------------------------------------------------------------------------------+ + | IMPALAD_MEM_LIMIT | Configures the maximum memory that can be used by an Impalad instance. | + +------------------------------------------+---------------------------------------------------------------------------------+ + +#. Add a customized resource configuration group. + + a. Determine whether to automatically adjust resource configurations based on the time. + + If yes, go to :ref:`3.b `. + + If no, go to :ref:`4 `. + + b. .. _mrs_01_0536__en-us_topic_0035209694_li207277341970: + + Click |image2| to add a resource configuration group. In the **Scheduling Time** area, click |image3|. The time policy configuration page is displayed. + + Modify the following parameters based on service requirements and click **OK**. + + - **Repeat**: If selected, the resource configuration group runs repeatedly based on the scheduling period. If not selected, set the date and time when the configuration of the group of resources can be applied. + - **Repeat Policy**: can be set to **Daily**, **Weekly**, and **Monthly**. This parameter is valid only when **Repeat** is selected. + - **Between**: indicates the time period between the start time and end time when the resource configuration is applied. Set a unique time range. If the time range overlaps with that of an existing group of resource configuration, the time range cannot be saved. This parameter is valid only when **Repeat** is selected. + + .. note:: + + - The **default** group of resource configuration takes effect in all undefined time segments. + - The newly added resource group is a parameter set that takes effect dynamically in a specified time range. + - The newly added resource group can be deleted. A maximum of four resource configuration groups that take effect dynamically can be added. + - Select a repetition policy. If the end time is earlier than the start time, the next day is labeled by default. For example, if a validity period ranges from 22:00 to 06:00, the customized resource configuration takes effect from 22:00 on the current day to 06:00 on the next day. + - If the repeat policy types of multiple configuration groups are different, the time ranges can overlap. The policy types are listed as follows by priority from low to high: daily, weekly, and monthly. The following is an example. There are two resource configuration groups using the monthly and daily policies, respectively. Their application time ranges in a day overlap as follows: [04:00 to 07:00] and [06:00 to 08:00]. In this case, the configuration of the group that uses the monthly policy prevails. + - If the repeat policy types of multiple resource configuration groups are the same, the time ranges of different dates can overlap. For example, if there are two weekly scheduling groups, you can set the same time range on different day for them, such as to 04:00 to 07:00, on Monday and Wednesday, respectively. + + c. On the **Service Pool Configuration** page, modify the resource configuration of each service. Click the blank area on the page to complete the editing, and go to :ref:`4 `. + + You can click |image4| on the right of **Service Pool Configuration** to modify the parameters. Click |image5| in the **Detailed Configuration** area to manually update the parameter values generated by the system based on service requirements. + +#. .. _mrs_01_0536__en-us_topic_0035209694_li5675506119820: + + Saves the settings. + + Click **Save**. In the **Save Configuration** dialog box, select **Restart the affected services or instances**. Click **OK** to save the settings and restart related services. + + **Operation succeeded** is displayed. click **Finish**. The service is started successfully. + +.. |image1| image:: /_static/images/en-us_image_0000001348738077.gif +.. |image2| image:: /_static/images/en-us_image_0000001295738420.jpg +.. |image3| image:: /_static/images/en-us_image_0000001348738077.gif +.. |image4| image:: /_static/images/en-us_image_0000001348738077.gif +.. |image5| image:: /_static/images/en-us_image_0000001348738077.gif diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/index.rst new file mode 100644 index 0000000..9117b2e --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0534.html + +.. _mrs_01_0534: + +Static Service Pool Management +============================== + +- :ref:`Viewing the Status of a Static Service Pool ` +- :ref:`Configuring a Static Service Pool ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + viewing_the_status_of_a_static_service_pool + configuring_a_static_service_pool diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/viewing_the_status_of_a_static_service_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/viewing_the_status_of_a_static_service_pool.rst new file mode 100644 index 0000000..ddccd14 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/static_service_pool_management/viewing_the_status_of_a_static_service_pool.rst @@ -0,0 +1,86 @@ +:original_name: mrs_01_0535.html + +.. _mrs_01_0535: + +Viewing the Status of a Static Service Pool +=========================================== + +Scenario +-------- + +MRS Manager manages and isolates service resources that are not running on YARN through the static service resource pool. It dynamically manages the total CPU, I/O, and memory resources that can be used by HDFS and YARN on the deployment node. The system supports time-based automatic adjustment of static service resource pools. This enables the cluster to automatically adjust the parameter values at different periods to ensure more efficient resource utilization. + +On MRS Manager, you can view the monitoring metrics of the resources used by each service in the static service pool. The monitoring metrics are as follows: + +- Service Total CPU Usage +- Service Total Disk I/O Read Speed +- Service Total Disk I/O Write Speed +- Service Total Memory Usage + +Procedure +--------- + +#. On MRS Manager, click **System**. In the **Resource** area, click **Configure Static Service Pool**. +#. Click **Status**. +#. Check the system resource adjustment base values. + + - **System Resource Adjustment Base** indicates the maximum volume of resources that can be used by each node in the cluster. If a node has only one service, the service exclusively occupies the available resources on the node. If a node has multiple services, all services share the available resources on the node. + - **CPU(%)** indicates the maximum number of CPUs that can be used by services on a node. + - **Memory(%)** indicates the maximum memory that can be used by services on a node. + +4. Check the cluster service resource usage. + + In the chart area, select **All services** from the service drop-down list box. The resource usage status of all services in the service pool is displayed. + + .. note:: + + **Effective Configuration Group** indicates the resource control configuration group used by the cluster service. By default, the **default** configuration group is used at all time every day, indicating that the cluster service can use all CPUs and 70% memory of the node. + +5. View the resource usage of a single service. + + In the chart area, select a service from the service drop-down list box. The resource usage status of the service is displayed. + +6. You can set the interval for automatically refreshing the page. + + The following refresh interval options are supported: + + - **Refresh every 30 seconds** + - **Refresh every 60 seconds** + - **Stop refreshing** + +7. In the **Period** area, select a time range for viewing service resources. The options are as follows: + + - Real time + - Last 3 hours + - Last 6 hours + - Last 24 hours + - Last week + - Last month + - Last 3 months + - Last 6 months + - Customize: If you select this option, you can customize the period for viewing monitoring data. + +8. Click **View** to view the service resource data in the corresponding time range. + +9. Customize a service resource report. + + a. Click **Customize** and select the service source indicators to be displayed. + + - Service Total Disk I/O Read Speed + - Service Total Memory Usage + - Service Total Disk I/O Write Speed + - Service Total CPU Usage + + b. Click **OK** to save the selected monitoring metrics for display. + + .. note:: + + Click **Clear** to cancel all the selected monitoring metrics in a batch. + +10. Export a monitoring report. + + Click **Export**. MRS Manager will generate a report about the selected service resources in a specified time of period. Save the report. + + .. note:: + + To view the curve charts of monitoring metrics in a specified period, click **View**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/clearing_configuration_of_a_queue.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/clearing_configuration_of_a_queue.rst new file mode 100644 index 0000000..96ad6bd --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/clearing_configuration_of_a_queue.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0549.html + +.. _mrs_01_0549: + +Clearing Configuration of a Queue +================================= + +Scenario +-------- + +Users can clear the configuration of a queue on MRS Manager when the queue does not need resources from a resource pool or if a resource pool needs to be disassociated from the queue. Clearing queue configurations means that the resource capacity policy of the queue is canceled. + +Prerequisites +------------- + +If a queue is to be unbound from a resource pool, this resource pool cannot serve as the default resource pool of the queue. Therefore, you must first change the default resource pool of the queue to another one. For details, see :ref:`Configuring a Queue `. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. Click the **Dynamic Resource Plan** tab. + +#. In **Resource Pools**, select a specified resource pool. + +#. Locate the specified queue in the **Resource Allocation** table, and click **Clear** in the **Operation** column. + + In the **Clear Queue Configuration** dialog box, click **OK** to clear the queue configuration in the current resource pool. + + .. note:: + + If no resource capacity policy is configured for a queue, the clearing function is unavailable for the queue by default. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_a_queue.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_a_queue.rst new file mode 100644 index 0000000..876893c --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_a_queue.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_0547.html + +.. _mrs_01_0547: + +Configuring a Queue +=================== + +Scenario +-------- + +This section describes how to modify the queue configuration for a specified tenant on MRS Manager. + +Prerequisites +------------- + +A tenant associated with Yarn and allocated dynamic resources has been added. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. +#. Click the **Dynamic Resource Plan** tab. +#. Click the **Queue Configuration** tab. +#. In the tenant queue table, click **Modify** in the **Operation** column of the specified tenant queue. + + .. note:: + + In the tenant list on the left of the **Tenant Management** tab, click the target tenant. In the window that is displayed, choose **Resource**. On the page that is displayed, click |image1| to open the queue modification page. + + .. table:: **Table 1** Queue configuration parameters + + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +================================+=================================================================================================================================================================================================================================================================+ + | Maximum Application | Specifies the maximum number of applications. The value ranges from 1 to 2147483647. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum AM Resource Percent | Specifies the maximum percentage of resources that can be used to run the ApplicationMaster in a cluster. The value ranges from 0 to 1. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Minimum User Limit Percent (%) | Specifies the minimum percentage of resources consumed by a user. The value ranges from 0 to 100. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Limit Factor | Specifies the limit factor of the maximum user resource usage. The maximum user resource usage percentage can be obtained by multiplying the limit factor with the percentage of the tenant's actual resource usage in the cluster. The minimum value is **0**. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Status | Specifies the current status of a resource plan. The values are **Running** and **Stopped**. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool | Specifies the resource pool used by a queue. The default value is **Default**. If you want to change the resource pool, configure the queue capacity first. For details, see :ref:`Configuring the Queue Capacity Policy of a Resource Pool `. | + +--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. |image1| image:: /_static/images/en-us_image_0000001348738077.gif diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst new file mode 100644 index 0000000..2c20b7f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/configuring_the_queue_capacity_policy_of_a_resource_pool.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_0548.html + +.. _mrs_01_0548: + +Configuring the Queue Capacity Policy of a Resource Pool +======================================================== + +Scenario +-------- + +After a resource pool is added, the capacity policies of available resources need to be configured for Yarn task queues. This ensures that tasks in the resource pool are running properly. Each queue can be configured with the queue capacity policy of only one resource pool. Users can view the queues in any resource pool and configure queue capacity policies. After the queue policies are configured, Yarn task queues and resource pools are associated. + +You can configure queue policies on MRS Manager. + +Prerequisites +------------- + +- A resource pool has been added. +- The task queues are not associated with other resource pools. By default, all queues are associated with the **Default** resource pool. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. Click the **Dynamic Resource Plan** tab. + +#. In **Resource Pools**, select a specified resource pool. + + **Available Resource Quota**: indicates that all resources in each resource pool are available for queues by default. + +#. Locate the specified queue in the **Resource Allocation** table, and click **Modify** in the **Operation** column. + +#. In **Modify Resource Allocation**, configure the resource capacity policy of the task queue in the resource pool. + + - **Capacity (%)**: specifies the percentage of the current tenant's computing resource usage. + - **Maximum Capacity (%)**: specifies the percentage of the current tenant's maximum computing resource usage. + +#. Click **OK** to save the settings. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_resource_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_resource_pool.rst new file mode 100644 index 0000000..c8fa3aa --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_resource_pool.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_0544.html + +.. _mrs_01_0544: + +Creating a Resource Pool +======================== + +Scenario +-------- + +In an MRS cluster, users can logically divide Yarn cluster nodes to combine multiple NodeManagers into a Yarn resource pool. Each NodeManager belongs to one resource pool only. The system contains a **Default** resource pool by default. All NodeManagers that are not added to customized resource pools belong to this resource pool. + +You can create a customized resource pool on MRS Manager and add hosts that have not been added to other customized resource pools to it. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. +#. Click the **Resource Pools** tab. +#. Click **Add Resource Pool**. +#. In **Create Resource Pool**, set the properties of the resource pool. + + - **Name**: Enter a name for the resource pool. The name of the newly created resource pool cannot be **Default**. + + The name consists of 1 to 20 characters and can contain digits, letters, and underscores (_) but cannot start with an underscore (_). + + - **Hosts**: In the host list on the left, select the name of a specified host and click |image1| to add the selected host to the resource pool. Only hosts in the cluster can be selected. The host list of a resource pool can be left blank. + +#. Click **OK**. +#. After a resource pool is created, users can view the **Name**, **Members**, **Type**, **vCore** and **Memory** in the resource pool list. Hosts that are added to the customized resource pool are no longer members of the **Default** resource pool. + +.. |image1| image:: /_static/images/en-us_image_0000001349257217.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_sub-tenant.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_sub-tenant.rst new file mode 100644 index 0000000..90b05ba --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_sub-tenant.rst @@ -0,0 +1,68 @@ +:original_name: mrs_01_0540.html + +.. _mrs_01_0540: + +Creating a Sub-tenant +===================== + +Scenario +-------- + +You can create a sub-tenant on MRS Manager if the resources of the current tenant need to be further allocated. + +Prerequisites +------------- + +- A parent tenant has been added. +- A tenant name has been planned. The name must not be the same as that of a role or Yarn queue that exists in the current cluster. +- If a sub-tenant requires storage resources, a storage directory has been planned based on service requirements, and the planned directory does not exist under the storage directory of the parent tenant. +- The resources that can be allocated to the current tenant have been planned and the sum of the resource percentages of direct sub-tenants under the parent tenant at every level does not exceed 100%. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. In the tenant list on the left, move the cursor to the tenant node to which a sub-tenant is to be added. Click **Create sub-tenant**. On the displayed page, configure the sub-tenant attributes according to the following table: + + .. table:: **Table 1** Sub-tenant parameters + + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================================+============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Parent tenant | Specifies the name of the parent tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Specifies the name of the current tenant. The value consists of 3 to 20 characters, and can contain letters, digits, and underscores (_). | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | The options include **Leaf** and **Non-leaf**. If **Leaf** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. If **Non-leaf** is selected, sub-tenants can be added to the current tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dynamic Resources | Specifies the dynamic computing resources for the current tenant. The system automatically creates a task queue named after the sub-tenant name in the Yarn parent queue. When dynamic resources are not **Yarn**, the system does not automatically create a task queue. If the parent tenant does not have dynamic resources, the sub-tenant cannot use dynamic resources. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Specifies the percentage of the resources used by the current tenant. The base value is the total resources of the parent tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max. Capacity (%) | Specifies the maximum percentage of the computing resources used by the current tenant. The base value is the total resources of the parent tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. The system automatically creates a file in the HDFS parent tenant directory. The file is named the same as the name of the sub-tenant. If storage resources are not **HDFS**, the system does not create a storage directory under the root directory of HDFS. If the parent tenant does not have storage resources, the sub-tenant cannot use storage resources. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota (MB) | Specifies the quota for HDFS storage space used by the current tenant. The minimum value is 1, and the maximum value is the total storage quota of the parent tenant. The unit is MB. This parameter indicates the maximum HDFS storage space that can be used by a tenant, but does not indicate the actual space used. If the value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the oHDFS physical disk. If the quota is greater than the quota of the parent tenant, the actual storage capacity is subject to the quota of the parent tenant. | + | | | + | | .. note:: | + | | | + | | To ensure data reliability, one copy of a file is automatically generated when the file is stored in HDFS. That is, two copies of the same file are stored by default. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Path | Specifies the tenant's HDFS storage directory. The system automatically creates a file folder named after the sub-tenant name in the directory of the parent tenant by default. For example, if the sub-tenant is **ta1s** and the parent directory is **tenant/ta1**, the system sets this parameter for the sub-tenant to **tenant/ta1/ta1s**. The storage path is customizable in the parent directory. The parent directory for the storage path must be the storage directory of the parent tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies other service resources associated with the current tenant. HBase is supported. To configure this parameter, click **Associate Services**. In the dialog box that is displayed, set **Service** to **HBase**. If **Association Mode** is set to **Exclusive**, service resources are occupied exclusively. If **share** is selected, service resources are shared. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Specifies the description of the current tenant. | + +-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the settings. + + It takes a few minutes to save the settings. If the **Tenant created successfully** is displayed in the upper-right corner, the tenant is added successfully. The tenant is created successfully. + + .. note:: + + - Roles, computing resources, and storage resources are automatically created when tenants are created. + - The new role has permissions on the computing and storage resources. The role and its permissions are controlled by the system automatically and cannot be controlled manually under **Manage Role**. + - When using this tenant, create a system user and assign the user a related tenant role. For details, see :ref:`Creating a User `. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_tenant.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_tenant.rst new file mode 100644 index 0000000..1105582 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/creating_a_tenant.rst @@ -0,0 +1,80 @@ +:original_name: mrs_01_0539.html + +.. _mrs_01_0539: + +Creating a Tenant +================= + +Scenario +-------- + +You can create a tenant on MRS Manager to specify the resource usage. + +Prerequisites +------------- + +- A tenant name has been planned. The name must not be the same as that of a role or Yarn queue that exists in the current cluster. +- If a tenant requires storage resources, a storage directory has been planned based on service requirements, and the planned directory does not exist under the HDFS directory. +- The resources that can be allocated to the current tenant have been planned and the sum of the resource percentages of direct sub-tenants under the parent tenant at every level does not exceed 100%. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. Click **Create Tenant**. On the page that is displayed, configure tenant properties. + + .. table:: **Table 1** Tenant parameters + + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=========================================+================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Specifies the name of the current tenant. The value consists of 3 to 20 characters, and can contain letters, digits, and underscores (_). | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tenant Type | The options include **Leaf** and **Non-leaf**. If **Leaf** is selected, the current tenant is a leaf tenant and no sub-tenant can be added. If **Non-leaf** is selected, sub-tenants can be added to the current tenant. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Dynamic Resources | Specifies the dynamic computing resources for the current tenant. The system automatically creates a task queue named after the tenant name in Yarn. When dynamic resources are not **Yarn**, the system does not automatically create a task queue. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Capacity (%) | Specifies the percentage of the computing resources used by the current tenant in the **default** resource pool. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Default Resource Pool Max. Capacity (%) | Specifies the maximum percentage of the computing resources used by the current tenant in the **default** resource pool. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Storage Resource | Specifies storage resources for the current tenant. The system automatically creates a file folder named after the tenant name in the **/tenant** directory. When a tenant is created for the first time, the system automatically creates the **/tenant** directory in the HDFS root directory. If storage resources are not **HDFS**, the system does not create a storage directory under the root directory of HDFS. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Space Quota (MB) | Specifies the quota for HDFS storage space used by the current tenant. The value ranges from **1** to **8796093022208**. The unit is MB. This parameter indicates the maximum HDFS storage space that can be used by a tenant, but does not indicate the actual space used. If the value is greater than the size of the HDFS physical disk, the maximum space available is the full space of the oHDFS physical disk. | + | | | + | | .. note:: | + | | | + | | To ensure data reliability, one copy of a file is automatically generated when the file is stored in HDFS. That is, two copies of the same file are stored by default. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value of **Storage Space Quota** is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Storage Path** | Specifies the tenant's HDFS storage directory. The system automatically creates a file folder named after the tenant name in the **/tenant** directory by default. For example, the default HDFS storage directory for tenant **ta1** is **tenant/ta1**. When a tenant is created for the first time, the system automatically creates the **/tenant** directory in the HDFS root directory. The storage path is customizable. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Specifies other service resources associated with the current tenant. HBase is supported. To configure this parameter, click **Associate Services**. In the dialog box that is displayed, set **Service** to **HBase**. If **Association Mode** is set to **Exclusive**, service resources are occupied exclusively. If **share** is selected, service resources are shared. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Specifies the description of the current tenant. | + +-----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK** to save the settings. + + It takes a few minutes to save the settings. If the **Tenant created successfully** is displayed in the upper-right corner, the tenant is added successfully. + + .. note:: + + - Roles, computing resources, and storage resources are automatically created when tenants are created. + - The new role has permissions on the computing and storage resources. The role and its permissions are controlled by the system automatically and cannot be controlled manually under **Manage Role**. + - If you want to use the tenant, create a system user and assign the Manager_tenant role and the role corresponding to the tenant to the user. For details, see :ref:`Creating a User `. + +Related Tasks +------------- + +Viewing an added tenant + +#. On MRS Manager, click **Tenant**. + +#. In the tenant list on the left, click the name of the added tenant. + + The **Summary** tab is displayed on the right by default. + +#. View **Basic Information**, **Resource Quota**, and **Statistics** of the tenant. + + If HDFS is in the **Stopped** state, **Available** and **Used** of **Space** in **Resource Quota** are **unknown.** diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_resource_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_resource_pool.rst new file mode 100644 index 0000000..1e3831f --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_resource_pool.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0546.html + +.. _mrs_01_0546: + +Deleting a Resource Pool +======================== + +Scenario +-------- + +You can delete an existing resource pool on MRS Manager. + +Prerequisites +------------- + +- Any queue in a cluster cannot use the resource pool to be deleted as the default resource pool. Before deleting the resource pool, cancel the default resource pool. For details, see :ref:`Configuring a Queue `. +- Resource distribution policies of all queues have been cleared from the resource pool being deleted. For details, see :ref:`Clearing Configuration of a Queue `. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. Click the **Resource Pools** tab. + +#. Locate the row that contains the specified resource pool, and click **Delete** in the **Operation** column. + + In the displayed dialog box, click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_tenant.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_tenant.rst new file mode 100644 index 0000000..f3d11a0 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/deleting_a_tenant.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0541.html + +.. _mrs_01_0541: + +Deleting a tenant +================= + +Scenario +-------- + +You can delete a tenant that is not required on MRS Manager. + +Prerequisites +------------- + +- A tenant has been added. +- You have checked whether the tenant to be deleted has sub-tenants. If the tenant has sub-tenants, delete them; otherwise, you cannot delete the tenant. +- The role of the tenant to be deleted cannot be associated with any user or user group. For details about how to cancel the binding between a role and a user, see :ref:`Modifying User Information `. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. In the tenant list on the left, move the cursor to the tenant node to be deleted and click **Delete**. + + The **Delete Tenant** dialog box is displayed. If you want to save the tenant data, select **Reserve the data of this tenant**. Otherwise, the tenant's storage space will be deleted. + +#. Click OK to save the settings. + + It takes a few minutes to save the configuration. After the tenant is deleted successfully, the role and storage space of the tenant are also deleted. + + .. note:: + + - After the tenant is deleted, the task queue of the tenant still exists in Yarn. + - If you choose not to reserve data when deleting the parent tenant, data of sub-tenants is also deleted if the sub-tenants use storage resources. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/index.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/index.rst new file mode 100644 index 0000000..6269eef --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/index.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0537.html + +.. _mrs_01_0537: + +Tenant Management +================= + +- :ref:`Overview ` +- :ref:`Creating a Tenant ` +- :ref:`Creating a Sub-tenant ` +- :ref:`Deleting a tenant ` +- :ref:`Managing a Tenant Directory ` +- :ref:`Restoring Tenant Data ` +- :ref:`Creating a Resource Pool ` +- :ref:`Modifying a Resource Pool ` +- :ref:`Deleting a Resource Pool ` +- :ref:`Configuring a Queue ` +- :ref:`Configuring the Queue Capacity Policy of a Resource Pool ` +- :ref:`Clearing Configuration of a Queue ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + creating_a_tenant + creating_a_sub-tenant + deleting_a_tenant + managing_a_tenant_directory + restoring_tenant_data + creating_a_resource_pool + modifying_a_resource_pool + deleting_a_resource_pool + configuring_a_queue + configuring_the_queue_capacity_policy_of_a_resource_pool + clearing_configuration_of_a_queue diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/managing_a_tenant_directory.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/managing_a_tenant_directory.rst new file mode 100644 index 0000000..dac6cdc --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/managing_a_tenant_directory.rst @@ -0,0 +1,98 @@ +:original_name: mrs_01_0542.html + +.. _mrs_01_0542: + +Managing a Tenant Directory +=========================== + +Scenario +-------- + +You can manage the HDFS storage directory used by a specific tenant on MRS Manager. The management operations include adding a tenant directory, modifying the directory file quota, modifying the storage space, and deleting a directory. + +Prerequisites +------------- + +A tenant associated with HDFS storage resources has been added. + +Procedure +--------- + +- Viewing a tenant directory + + #. On MRS Manager, click **Tenant**. + #. In the tenant list on the left, click the target tenant. + #. Click the **Resource** tab. + #. View the **HDFS Storage** table. + + - The Quota column indicates the quantity quotas of files and directories. + - The **Storage Space Quota** column indicates the storage space size of the tenant directory. + +- Adding a tenant directory + + #. On MRS Manager, click **Tenant**. + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be added. + #. Click the **Resource** tab. + #. In the **HDFS Storage** table, click **Create Directory**. + + - In **Parent Directory**, select a storage directory of a parent tenant. + + This parameter applies only to sub-tenants. If the parent tenant has multiple directories, select any of them. + + - Set **Path** to a tenant directory path. + + .. note:: + + - If the current tenant is not a sub-tenant, the new path is created in the HDFS root directory. + - If the current tenant is a sub-tenant, the new path is created in the specified directory. + + A complete HDFS storage directory can contain a maximum of 1,023 characters. An HDFS directory name contains digits, letters, spaces, and underscores (_). The name cannot start or end with a space. + + - Set **Quota** to the quotas of file and directory quantity. + + **Maximum Number of Files/Directories** is optional. Its value ranges from **1** to **9223372036854775806**. + + - Set **Storage Space Quota** to the storage space size of the tenant directory. + + The value of **Storage Space Quota** ranges from **1** to **8796093022208**. + + .. note:: + + To ensure data reliability, one copy of a file is automatically generated when the file is stored in HDFS. That is, two copies of the same file are stored by default. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value of **Storage Space Quota** is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). + + #. Click **OK**. The system creates tenant directories in the HDFS root directory. + +- Modify a tenant directory. + + #. On MRS Manager, click **Tenant**. + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be modified. + #. Click the **Resource** tab. + #. In the **HDFS Storage** table, click **Modify** in the **Operation** column of the specified tenant directory. + + - Set **Quota** to the quotas of file and directory quantity. + + **Maximum Number of Files/Directories** is optional. Its value ranges from **1** to **9223372036854775806**. + + - Set **Storage Space Quota** to the storage space size of the tenant directory. + + The value of **Storage Space Quota** ranges from **1** to **8796093022208**. + + .. note:: + + To ensure data reliability, one copy of a file is automatically generated when the file is stored in HDFS. That is, two copies of the same file are stored by default. The HDFS storage space indicates the total disk space occupied by all these copies. For example, if the value of **Storage Space Quota** is set to **500**, the actual space for storing files is about 250 MB (500/2 = 250). + + #. Click **OK**. + +- Delete a tenant directory. + + #. On MRS Manager, click **Tenant**. + + #. In the tenant list on the left, click the tenant whose HDFS storage directory needs to be deleted. + + #. Click the **Resource** tab. + + #. In the **HDFS Storage** table, click **Delete** in the **Operation** column of the specified tenant directory. + + The default HDFS storage directory set during tenant creation cannot be deleted. Only the newly added HDFS storage directory can be deleted. + + #. Click **OK**. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/modifying_a_resource_pool.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/modifying_a_resource_pool.rst new file mode 100644 index 0000000..27ed0f2 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/modifying_a_resource_pool.rst @@ -0,0 +1,27 @@ +:original_name: mrs_01_0545.html + +.. _mrs_01_0545: + +Modifying a Resource Pool +========================= + +Scenario +-------- + +You can modify members of an existing resource pool on MRS Manager. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. +#. Click the **Resource Pools** tab. +#. Locate the row that contains the specified resource pool, and click **Modify** in the **Operation** column. +#. In **Modify Resource Pool**, modify **Added Hosts**. + + - Adding a host: Select the name of a specified host in host list on the left and click |image1| to add the selected host to the resource pool. + - Deleting a host: In the host list on the right, select the name of a specified host and click |image2| to add the selected host to the resource pool. The host list of a resource pool can be left blank. + +#. Click **OK**. + +.. |image1| image:: /_static/images/en-us_image_0000001349257217.png +.. |image2| image:: /_static/images/en-us_image_0000001295898072.png diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/overview.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/overview.rst new file mode 100644 index 0000000..7643156 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/overview.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0538.html + +.. _mrs_01_0538: + +Overview +======== + +Definition +---------- + +An MRS cluster provides various resources and services for multiple organizations, departments, or applications to share. The cluster provides tenants as a logical entity to use these resources and services. A mode involving different tenants is called multi-tenant mode. Currently, only the analysis cluster supports tenant management. + +Principles +---------- + +The MRS cluster provides the multi-tenant function. It supports a layered tenant model and allows dynamic adding or deleting of tenants to isolate resources. It dynamically manages and configures tenants' computing and storage resources. + +The computing resources indicate tenants' Yarn task queue resources. The task queue quota can be modified, and the task queue usage status and statistics can be viewed. + +The storage resources can be stored on HDFS. You can add and delete the HDFS storage directories of tenants, and set the quotas of file quantity and the storage space of the directories. + +As the unified tenant management platform of MRS clusters, MRS Manager provides enterprises with time-tested multi-tenant management models, enabling centralized tenant and service management. Tenants can create and manage tenants in a cluster based on service requirements. + +- Roles, computing resources, and storage resources are automatically created when tenants are created. By default, all permissions of the new computing resources and storage resources are allocated to a tenant's roles. +- Permissions to view the current tenant's resources, add a subtenant, and manage the subtenant's resources are granted to the tenant's roles by default. +- After you have modified the tenant's computing or storage resources, permissions of the tenant's roles are automatically updated. + +MRS Manager supports a maximum of 512 tenants. The tenants that are created by default in the system contain **default**. Tenants that are in the topmost layer with the default tenant are called level-1 tenants. + +Resource Pools +-------------- + +Yarn task queues support only the label-based scheduling policy. This policy enables Yarn task queues to associate NodeManagers that have specific node labels. In this way, Yarn tasks run on specified nodes so that tasks are scheduled and certain hardware resources are utilized. For example, Yarn tasks requiring a large memory capacity can run on nodes with a large memory capacity by means of label association, preventing poor service performance. + +In an MRS cluster, the tenant logically divides Yarn cluster nodes to combine multiple NodeManagers into a resource pool. Yarn task queues can be associated with specified resource pools by configuring queue capacity policies, ensuring efficient and independent resource utilization in the resource pools. + +MRS Manager supports a maximum of 50 resource pools. The system has a **Default** resource pool. diff --git a/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/restoring_tenant_data.rst b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/restoring_tenant_data.rst new file mode 100644 index 0000000..be710a7 --- /dev/null +++ b/umn/source/mrs_manager_operation_guide_applicable_to_2.x_and_earlier_versions/tenant_management/restoring_tenant_data.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0543.html + +.. _mrs_01_0543: + +Restoring Tenant Data +===================== + +Scenario +-------- + +Tenant data is stored on Manager and in cluster components by default. When components are restored from faults or reinstalled, some tenant configuration data may be abnormal. In this case, you can manually restore the tenant data. + +Procedure +--------- + +#. On MRS Manager, click **Tenant**. + +#. In the tenant list on the left, click a tenant node. + +#. Check the status of the tenant data. + + a. In **Summary**, check the color of the circle on the left of **Basic Information**. Green indicates that the tenant is available and gray indicates that the tenant is unavailable. + b. Click **Resources** and check the circle on the left of **Yarn** or **HDFS Storage**. Green indicates that the resource is available, and gray indicates that the resource is unavailable. + c. Click **Service Association** and check the **Status** column of the associated service table. **Good** indicates that the component can provide services for the associated tenant. **Bad** indicates that the component cannot provide services for the tenant. + d. If any check result is abnormal, go to :ref:`4 ` to restore tenant data. + +#. .. _mrs_01_0543__en-us_topic_0035271545_li10849798195335: + + Click **Restore Tenant Data**. + +#. In the **Restore Tenant Data** window, select one or more components whose data needs to be restored. Click **OK**. The system automatically restores the tenant data. diff --git a/umn/source/mrs_quick_start/creating_a_cluster.rst b/umn/source/mrs_quick_start/creating_a_cluster.rst new file mode 100644 index 0000000..bf0fb98 --- /dev/null +++ b/umn/source/mrs_quick_start/creating_a_cluster.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_0027.html + +.. _mrs_01_0027: + +Creating a Cluster +================== + +The first step of using MRS is to create a cluster. This section describes how to create a cluster on the MRS management console. + +Procedure +--------- + +#. Log in to the MRS console. + +#. Click **Create Cluster**, the **Create Cluster** page is displayed + + .. note:: + + When creating a cluster, pay attention to quota notification. If a resource quota is insufficient, increase the resource quota as prompted and create a cluster. + +#. On the page for create a cluster, click the **Custom Config** tab. + +#. Configure cluster software information. + + - **Region**: Use the default value. + - **Cluster Name**: You can use the default name. However, you are advised to include a project name abbreviation or date for consolidated memory and easy distinguishing, for example, **mrs_20180321**. + - **Cluster Version**: Select the latest version, which is the default value. + - **Cluster Type**: Use the default **Analysis Cluster**. + - **Component Port**: Use the default **Open source**. + - **Component**: Select components such as Spark2x, HBase, and Hive for the analysis cluster. For a streaming cluster, select components such as Kafka and Storm. For a hybrid cluster, you can select the components of the analysis cluster and streaming cluster based on service requirements. + +#. Click **Next**. + + - **AZ**: Use the default value. + - **VPC**: Use the default value. If there is no available VPC, click **View VPC** to access the VPC console and create a new VPC. + - **Subnet**: Use the default value. + - **Security Group**: Select **Auto create**. + - **EIP**: Select **Bind later**. + - **Enterprise Project**: Use the default value. + - **Instance Specifications**: Select General Computing S3 -> 8 vCPUs \| 16 GB (s3.2xlarge.2) for both Master and Core nodes. + - **System Disk**: Select **Common I/O** and retain the default settings. + - **Data Disk**: Select **Common I/O** and retain the default settings. + - **Instance Count**: The default number of Master nodes is 2, and that of Core nodes is 3. + +#. Click **Next**. The **Set Advanced Options** tab page is displayed. Configure the following parameters. Retain the default settings for the other parameters. + + - Kerberos authentication: + + - **Kerberos Authentication**: Disable Kerberos authentication. + - **Username**: name of the Manager administrator. **admin** is used by default. + - **Password**: password of the Manager administrator. + + - **Login Mode**: Select a mode for logging in to an ECS. + + - **Password**: Set a password for logging in to an ECS. + - **Key Pair**: Select a key pair from the drop-down list. Select **"I acknowledge that I have obtained private key file** *SSHkey-xxx* **and that without this file I will not be able to log in to my ECS.**" If you have never created a key pair, click **View Key Pair** to create or import a key pair. And then, obtain a private key file. + + - **Secure Communications**: Select **Enable**. + +#. Click **Apply Now**. + + If Kerberos authentication is enabled for a cluster, check whether Kerberos authentication is required. If yes, click **Continue**. If no, click **Back** to disable Kerberos authentication and then create a cluster. + +#. Click **Back to Cluster List** to view the cluster status. + + It takes some time to create a cluster. The initial status of the cluster is **Starting**. After the cluster has been created successfully, the cluster status becomes **Running**. diff --git a/umn/source/mrs_quick_start/creating_a_job.rst b/umn/source/mrs_quick_start/creating_a_job.rst new file mode 100644 index 0000000..567bbfb --- /dev/null +++ b/umn/source/mrs_quick_start/creating_a_job.rst @@ -0,0 +1,212 @@ +:original_name: mrs_01_0029.html + +.. _mrs_01_0029: + +Creating a Job +============== + +You can submit programs developed by yourself to MRS to execute them, and obtain the results. + +This section describes how to submit a job (take a MapReduce job as an example) on the MRS management console. MapReduce jobs are used to submit JAR programs to quickly process massive amounts of data in parallel and create a distributed data processing and execution environment. + +If the job and file management functions are not supported on the cluster details page, submit the jobs in the background. + +Before creating a job, you need to upload local data to OBS for data computing and analyzing. MRS allows exporting data from OBS to HDFS for computing and analyzing. After the analyzing and computing are complete, you can store the data in HDFS or export them to OBS. HDFS and OBS can also store the compressed data in the format of **bz2** or **gz**. + +Submitting a Job on the GUI +--------------------------- + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters**, select a running cluster, and click its name to switch to the cluster details page. + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If Kerberos authentication is not enabled for the cluster, skip this step. + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. For details, see :ref:`Synchronizing IAM Users to MRS `. + + .. note:: + + - In MRS 1.7.2 or earlier, the job management function is unavailable in a cluster with Kerberos authentication enabled. You need to submit a job in the background. + - When the policy of the user group to which the IAM user belongs changes from MRS ReadOnlyAccess to MRS CommonOperations, MRS FullAccess, or MRS Administrator, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from MRS CommonOperations, MRS FullAccess, or MRS Administrator to MRS ReadOnlyAccess, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + +#. Click the **Jobs** tab. + +#. Click **Create**. The **Create Job** page is displayed. + + .. note:: + + If the IAM username contains spaces (for example, **admin 01**), a job cannot be created. + +#. In **Type**, select **MapReduce**. Configure other job information. + + - Configure MapReduce job information by referring to :ref:`Table 1 `\ if the cluster version is MRS 1.9.2 or later. + - Configure MapReduce job information by referring to :ref:`Table 3 ` if the cluster version is earlier than MRS 1.9.2. + + .. _mrs_01_0029__en-us_topic_0264268721_table2037463920278: + + .. table:: **Table 1** Job configuration information + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===========================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **obs://**. Example: **obs://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript and HiveScript, the path must end with **.sql**. For MapReduce, the path must end with **.jar**. For Flink and SparkSubmit, the path must end with **.jar** or **.py**. The **.sql**, **.jar**, and **.py** are case-insensitive. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | (Optional) It is the key parameter for program execution. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Program class name* *Data input path* *Data output path* | + | | | + | | - Program class name: It is specified by a function in your program. MRS is responsible for transferring parameters only. | + | | | + | | - Data input path: Click **HDFS** or **OBS** to select a path or manually enter a correct path. | + | | | + | | - Data output path: Enter a directory that does not exist. | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. caution:: | + | | | + | | CAUTION: | + | | If you enter a parameter with sensitive information (such as the login password), the parameter may be exposed in the job details display and log printing. Exercise caution when performing this operation. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Service Parameter | (Optional) It is used to modify service parameters for the job. The parameter modification applies only to the current job. To make the modification take effect permanently for the cluster, follow instructions in :ref:`Configuring Service Parameters `. | + | | | + | | To add multiple parameters, click |image1| on the right. To delete a parameter, click **Delete** on the right. | + | | | + | | :ref:`Table 2 ` lists the common service configuration parameters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command Reference | Command submitted to the background for execution when a job is submitted. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. _mrs_01_0029__en-us_topic_0264268721_table12538926589: + + .. table:: **Table 2** **Service Parameter** parameters + + +-------------------+----------------------------------------------------+---------------+ + | Parameter | Description | Example Value | + +===================+====================================================+===============+ + | fs.obs.access.key | Key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + | fs.obs.secret.key | Key corresponding to the key ID for accessing OBS. | ``-`` | + +-------------------+----------------------------------------------------+---------------+ + + .. _mrs_01_0029__en-us_topic_0264268721_table13750103511814: + + .. table:: **Table 3** Job configuration information + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=============================================================================================================================================================================================================================================================================================================================================================================+ + | Name | Job name. It contains 1 to 64 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. | + | | | + | | .. note:: | + | | | + | | You are advised to set different names for different jobs. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Program Path | Path of the program package to be executed. The following requirements must be met: | + | | | + | | - Contains a maximum of 1,023 characters, excluding special characters such as ``;|&><'$.`` The parameter value cannot be empty or full of spaces. | + | | - The path of the program to be executed can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. Example: **s3a://wordcount/program/xxx.jar** | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | - For SparkScript, the path must end with **.sql**. For MapReduce and Spark, the path must end with **.jar**. The **.sql** and **.jar** are case-insensitive. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameters | Key parameter for program execution. The parameter is specified by the function of the user's program. MRS is only responsible for loading the parameter. Multiple parameters are separated by space. | + | | | + | | Configuration method: *Package name*.\ *Class name* | + | | | + | | The parameter contains a maximum of 150,000 characters. It cannot contain special characters ``;|&><'$,`` but can be left blank. | + | | | + | | .. note:: | + | | | + | | When entering a parameter containing sensitive information (for example, login password), you can add an at sign (@) before the parameter name to encrypt the parameter value. This prevents the sensitive information from being persisted in plaintext. When you view job information on the MRS management console, the sensitive information is displayed as **\***. | + | | | + | | Example: **username=admin @password=admin_123** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Import From | Path for inputting data | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. For details about how to import data to HDFS, see :ref:`Importing Data `. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Export To | Path for outputting data | + | | | + | | .. note:: | + | | | + | | - When setting this parameter, select **OBS** or **HDFS**. Select a file directory or manually enter a file directory, and click **OK**. | + | | - If you add the **hadoop-mapreduce-examples-x.x.x.jar** sample program or a program similar to **hadoop-mapreduce-examples-x.x.x.jar**, enter a directory that does not exist. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. (Supported only in MRS 1.8.10 and earlier versions) | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Path | Path for storing job logs that record job running status. | + | | | + | | Data can be stored in HDFS or OBS. The path varies depending on the file system. | + | | | + | | - OBS: The path must start with **s3a://**. | + | | - HDFS: The path must start with **/user**. | + | | | + | | The parameter contains a maximum of 1,023 characters, excluding special characters such as ``;|&>,<'$,`` and can be left blank. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Confirm job configuration information and click **OK**. + + After the job is created, you can manage it. + +Submitting a Job in the Background +---------------------------------- + +In MRS 3.x and later versions, the default installation path of the client is /opt/Bigdata/client. In MRS 3.x and earlier versions, the default installation path is /opt/client. For details, see the actual situation. + +#. Log in to a Master node. For details, see :ref:`Logging In to an ECS `. + +#. Run the following command to initialize environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** **MRS cluster user** + + Example: **kinit admin** + +#. Run the following command to copy the program in the OBS file system to the Master node in the cluster: + + **hadoop fs -Dfs.obs.access.key=AK -Dfs.obs.secret.key=SK -copyToLocal source_path.jar target_path.jar** + + Example: **hadoop fs -Dfs.obs.access.key=XXXX -Dfs.obs.secret.key=XXXX -copyToLocal "obs://mrs-word/program/hadoop-mapreduce-examples-XXX.jar" "/home/omm/hadoop-mapreduce-examples-XXX.jar"** + + You can log in to OBS Console using AK/SK. To obtain AK/SK information, click the username in the upper right corner of the management console and choose **My Credentials** > **Access Keys**. + +#. Run the following command to submit a wordcount job. If data needs to be read from OBS or outputted to OBS, the AK/SK parameters need to be added. + + **source /opt/Bigdata/client/bigdata_env;hadoop jar execute_jar wordcount input_path output_path** + + Example: **source /opt/Bigdata/client/bigdata_env;hadoop jar /home/omm/hadoop-mapreduce-examples-XXX.jar wordcount -Dfs.obs.access.key=XXXX -Dfs.obs.secret.key=XXXX "obs://mrs-word/input/*" "obs://mrs-word/output/"** + + In the preceding command, **input_path** indicates a path for storing job input files on OBS. **output_path** indicates a path for storing job output files on OBS and needs to be set to a directory that does not exist + +.. |image1| image:: /_static/images/en-us_image_0000001349137577.png diff --git a/umn/source/mrs_quick_start/how_to_use_mrs.rst b/umn/source/mrs_quick_start/how_to_use_mrs.rst new file mode 100644 index 0000000..081cf35 --- /dev/null +++ b/umn/source/mrs_quick_start/how_to_use_mrs.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0511.html + +.. _mrs_01_0511: + +How to Use MRS +============== + +MapReduce Service (MRS) is a cloud service that is used to deploy and manage the Hadoop system and enables one-click Hadoop cluster deployment. MRS provides enterprise-level big data clusters on the cloud. Tenants can fully control the clusters and easily run big data components such as Hadoop, Spark, HBase, Kafka, and Storm in the clusters. + +MRS is easy to use. You can execute various tasks and process or store PB-level data using computers connected in a cluster. The procedure of using MRS is as follows: + +#. Upload local programs and data files to OBS. +#. Create a cluster by following instructions in :ref:`Creating a Custom Cluster `. You can choose a cluster type for offline data analysis or stream processing or both, and set ECS instance specifications, instance count, data disk type (common I/O, high I/O, and ultra-high I/O), and components to be installed such as Hadoop, Spark, HBase, Hive, Kafka, and Storm in a cluster. You can use a :ref:`bootstrap action ` to execute a script on a specified node before or after the cluster is started to install additional third-party software, modify the cluster running environment, and perform other customizations. +#. :ref:`Manage jobs `. MRS provides a platform for executing programs you develop. You can submit, execute, and monitor such programs on MRS. +#. :ref:`Manage clusters `. MRS provides you with MRS Manager, an enterprise-level unified management platform of big data clusters, helping you quickly know health status of services and hosts. Through graphical metric monitoring and customization, you can obtain critical system information in a timely manner. In addition, you can modify service attribute configurations based on service performance requirements, and start or stop clusters, services, and role instances in one click. +#. :ref:`Terminate a cluster `. You can terminate an MRS cluster that is no longer use after job execution is complete. diff --git a/umn/source/mrs_quick_start/index.rst b/umn/source/mrs_quick_start/index.rst new file mode 100644 index 0000000..70ca323 --- /dev/null +++ b/umn/source/mrs_quick_start/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0024.html + +.. _mrs_01_0024: + +MRS Quick Start +=============== + +- :ref:`How to Use MRS ` +- :ref:`Creating a Cluster ` +- :ref:`Uploading Data and Programs ` +- :ref:`Creating a Job ` +- :ref:`Using Clusters with Kerberos Authentication Enabled ` +- :ref:`Terminating a Cluster ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_to_use_mrs + creating_a_cluster + uploading_data_and_programs + creating_a_job + using_clusters_with_kerberos_authentication_enabled + terminating_a_cluster diff --git a/umn/source/mrs_quick_start/terminating_a_cluster.rst b/umn/source/mrs_quick_start/terminating_a_cluster.rst new file mode 100644 index 0000000..63829ac --- /dev/null +++ b/umn/source/mrs_quick_start/terminating_a_cluster.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_0469.html + +.. _mrs_01_0469: + +Terminating a Cluster +===================== + +You can terminate an MRS cluster that is no longer use after job execution is complete. + +Background +---------- + +You can manually terminate a cluster after data analysis is complete or when the cluster encounters an exception. A cluster failed to be deployed will be automatically terminated. + +Procedure +--------- + +#. Log in to the MRS console. + +#. In the navigation tree of the MRS console, choose **Clusters** > **Active Clusters**. + +#. Locate the cluster to be terminated, and click **Terminate** in the **Operation** column. + + The cluster status changes from **Running** to **Terminating**, and finally to **Terminated**. You can view the clusters in **Terminated** state in **Cluster History**. diff --git a/umn/source/mrs_quick_start/uploading_data_and_programs.rst b/umn/source/mrs_quick_start/uploading_data_and_programs.rst new file mode 100644 index 0000000..230d34a --- /dev/null +++ b/umn/source/mrs_quick_start/uploading_data_and_programs.rst @@ -0,0 +1,105 @@ +:original_name: mrs_01_0028.html + +.. _mrs_01_0028: + +Uploading Data and Programs +=========================== + +Through the **Files** tab page, you can create, delete, import, export, delete files in the analysis cluster. + +Background +---------- + +MRS clusters process data from OBS or HDFS. OBS provides customers with the data storage capabilities that are massive, secure, reliable, and cost-effective. MRS can directly process data in OBS. You can browse, manage, and use data on the web page of the management console and OBS Client. + +Importing Data +-------------- + +Currently, MRS can only import data from OBS to HDFS. The file upload rate decreases with the increase of the file size. This mode applies to scenarios where the data volume is small. + +You can perform the following steps to import files and directories: + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's information. + +#. Click the **Files** tab to go to the file management page. + +#. Select **HDFS File List**. + +#. Go to the data storage directory, for example, **bd_app1**. + + The **bd_app1** directory is only an example. You can use any directory on the page or create a new one. + + The requirements for creating a folder are as follows: + + - The folder name contains a maximum of 255 characters. The full path cannot exceed 1,023 characters. + - The folder name cannot be empty. + - The folder name cannot contain the following special characters: :literal:`/:*?"<>|\\;&,'`!{}[]$%+` + - The value cannot start or end with a period (.). + - The spaces at the beginning and end are ignored. + +#. Click **Import Data** and configure the HDFS and OBS paths correctly. When configuring the OBS or HDFS path, click **Browse**, select a file directory, and click **Yes**. + + - OBS path + + - The path must start with **obs://**. In MRS 1.7.2 or earlier, the value must start with **s3a://**. + - Files or programs encrypted by KMS cannot be imported. + - An empty folder cannot be imported. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The OBS full path contains a maximum of 1,023 characters. + + - HDFS path + + - The path starts with **/user** by default. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\:`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The HDFS full path contains a maximum of 1,023 characters. + - The HDFS parent directory in **HDFS File List** is displayed in the **HDFS Path** text box by default. + +#. Click **OK**. + + You can view the file upload progress on the **File Operation Records** tab page. MRS processes the data import operation as a DistCp job. You can also check whether the DistCp job is successfully executed on the **Jobs** tab page. + +Exporting Data +-------------- + +After the data analysis and computing are completed, you can store the data in HDFS or export them to OBS. + +You can perform the following steps to export files and directories: + +#. Log in to the MRS console. + +#. Choose **Clusters > Active Clusters** and click the name of the cluster to be queried to enter the page displaying the cluster's basic information. + +#. Click the **Files** tab to go to the file management page. + +#. Select **HDFS File List**. + +#. Go to the data storage directory, for example, **bd_app1**. + +#. Click **Export Data** and configure the OBS and HDFS paths. When configuring the OBS or HDFS path, click **Browse**, select a file directory, and click **Yes**. + + - OBS path + + - The path must start with **obs://**. In MRS 1.7.2 or earlier, the value must start with **s3a://**. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters ``;|&>,<'$*?\`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The OBS full path contains a maximum of 1,023 characters. + + - HDFS path + + - The path starts with **/user** by default. + - The directory and file name can contain letters, digits, hyphens (-), and underscores (_), but cannot contain the following special characters: ``;|&>,<'$*?\:`` + - The directory and file name cannot start or end with a space, but can contain spaces between them. + - The HDFS full path contains a maximum of 1,023 characters. + - The HDFS parent directory in **HDFS File List** is displayed in the **HDFS Path** text box by default. + + .. note:: + + When a folder is exported to OBS, a label file named **folder name_$folder$** is added to the OBS path. Ensure that the exported folder is not empty. If the exported folder is empty, OBS cannot display the folder and only generates a file named **folder name_$folder$**. + +#. Click **OK**. + + You can view the file upload progress on the **File Operation Records** tab page. MRS processes the data export operation as a DistCp job. You can also check whether the DistCp job is successfully executed on the **Jobs** tab page. diff --git a/umn/source/mrs_quick_start/using_clusters_with_kerberos_authentication_enabled.rst b/umn/source/mrs_quick_start/using_clusters_with_kerberos_authentication_enabled.rst new file mode 100644 index 0000000..4d63e01 --- /dev/null +++ b/umn/source/mrs_quick_start/using_clusters_with_kerberos_authentication_enabled.rst @@ -0,0 +1,211 @@ +:original_name: mrs_09_0003.html + +.. _mrs_09_0003: + +Using Clusters with Kerberos Authentication Enabled +=================================================== + +This section instructs you to use security clusters and run MapReduce, Spark, and Hive programs. + +The Presto component of MRS 3.x does not support Kerberos authentication. + +You can get started by reading the following topics: + +#. :ref:`Creating a Security Cluster and Logging In to Manager ` +#. :ref:`Creating a Role and a User ` +#. :ref:`Running a MapReduce Program ` + +.. _mrs_09_0003__en-us_topic_0227922353_section14303124313558: + +Creating a Security Cluster and Logging In to Manager +----------------------------------------------------- + +#. Create a security cluster. For details, see :ref:`Creating a Custom Cluster `. Enable **Kerberos Authentication**, set **Password**, and confirm the password. This password is used to log in to Manager. Keep it secure. +#. Log in to the MRS console. +#. In the navigation pane on the left, choose **Active Clusters** and click the target cluster name on the right to access the cluster details page. +#. Click **Access Manager** on the right of **MRS Manager** to log in to Manager. + + - If you have bound an EIP when creating the cluster, perform the following operations: + + a. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from your local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission to access port 9022 of Knox for accessing Manager. + + b. Select **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address.** + + - If you have not bound an EIP when creating the cluster, perform the following operations: + + a. Select an available EIP from the drop-down list or click **Manage EIP** to create one. + b. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from the local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission of port 9022 to access Knox for accessing MRS Manager. + + c. Select **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address.** + +#. Click **OK**. The Manager login page is displayed. To assign permissions to other users to access Manager, add their public IP addresses as trusted ones by referring to :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. +#. Enter the default username **admin** and the password you set when creating the cluster, and click **Log In**. + +If the cluster version is earlier than MRS 1.8.0, perform the following steps: + +#. .. _mrs_09_0003__en-us_topic_0227922353_li3858195882716: + + Create a security cluster. For details, see . Enable **Kerberos Authentication** and set parameters including **Password** and **Confirm Password**. This password is used to log in to Manager. Keep it secure. + +#. .. _mrs_09_0003__en-us_topic_0227922353_en-us_topic_0046344332_li1519293110210: + + Log in to the MRS console and choose **Active Clusters**. + + .. note:: + + - For details about how to access Manager that supports Kerberos authentication, see :ref:`2 ` to :ref:`7 `, or see . + - For analysis and streaming clusters, the methods of accessing Manager that supports Kerberos authentication are the same. + +#. .. _mrs_09_0003__en-us_topic_0227922353_li5015950919196: + + On the **Active Clusters** page, click the name of the security cluster you created. + + On the cluster details page, take note of the **AZ**, **VPC**, **Cluster Manager IP Address**, and **Default Security Group** of the master node. + +#. On the ECS console, create an ECS. + + - The **AZ**, **VPC**, and **Security Group** of the ECS must be the same as those of the cluster to be accessed. + - Select a Windows public image. + - For details about other configuration parameters, see **Elastic Cloud Server > User Guide > Getting Started > Creating and Logging In to a Windows ECS**. + + .. note:: + + If the security group of the ECS is different from **Default Security Group** of the master node, you can modify the configuration using either of the following methods: + + - Change the security group of the ECS to the default security group of the master node. For details, see **Elastic Cloud Server** > **User Guide** > **Security Groups** > **Changing a Security Group**. + - Add two security group rules to the security groups of the master and core nodes to enable the ECS to access the cluster. Set **Protocol** to **TCP** and **ports** of the two security group rules to **28443** and **20009**, respectively. For details, see **Virtual Private Cloud > User Guide > Security > Security Group > Adding a Security Group Rule**. + +#. On the VPC console, apply for an EIP and bind it to the ECS. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Log in to the ECS. + + The Windows system account, password, EIP, and the security group rules are required for logging in to the ECS. For details, see **Elastic Cloud Server > User Guide > ECS Logins > Logging In to a Windows ECS**. + +#. .. _mrs_09_0003__en-us_topic_0227922353_en-us_topic_0046344332_li66227810104856: + + On the Windows remote desktop, use your browser to access Manager. + + For example, you can use Internet Explorer 11 in the Windows 2012 OS. + + The Manager access address is in the format of **https://**\ *Cluster Manager IP address*\ **:28443/web**. **Cluster Manager IP address** is the **Cluster Manager IP Address** obtained in :ref:`3 `. When you access Manager, you need to enter the MRS cluster username, for example, **admin**, and the password you set when enabling **Kerberos Authentication** during cluster creation in :ref:`1 `. + + .. note:: + + - If you access Manager with another MRS cluster username, change the password upon your first login. The new password must meet the requirements of the current password complexity policies. + - By default, an account is locked after five consecutive incorrect password attempts. It is automatically unlocked after 5 minutes. + +.. _mrs_09_0003__en-us_topic_0227922353_section14306114385510: + +Creating a Role and a User +-------------------------- + +For clusters with Kerberos authentication enabled, perform the following steps to create a user and assign permissions to the user to run programs. + +#. On Manager, choose **System** > **Permission** > **Role**. + +#. Click **Create Role**. For details, see :ref:`Creating a Role `. + + Specify the following information: + + - Enter a role name, for example, **mrrole**. + - In **Configure Resource Permission**, select the cluster to be operated, choose **Yarn** > **Scheduler Queue** > **root**, and select **Submit** and **Admin** in the **Permission** column. After you finish configuration, do not click **OK** but click the name of the target cluster shown in the following figure and then configure other permissions. + - Choose **HBase** > **HBase Scope**. Locate the row that contains **global**, and select **create**, **read**, **write**, and **execute** in the **Permission** column. After you finish configuration, do not click **OK** but click the name of the target cluster shown in the following figure and then configure other permissions. + - Choose **HDFS** > **File System** > **hdfs://hacluster/** and select **Read**, **Write**, and **Execute** in the **Permission** column. After you finish configuration, do not click **OK** but click the name of the target cluster shown in the following figure and then configure other permissions. + - Choose **Hive** > **Hive Read Write Privileges**, select **Select**, **Delete**, **Insert**, and **Create** in the **Permission** column, and click **OK**. + +#. Choose **System**. In the navigation pane on the left, choose **Permission** > **User Group** > **Create User Group** to create a user group for the sample project, for example, **mrgroup**. For details, see :ref:`Creating a User Group `. + +#. Choose **System**. In the navigation pane on the left, choose **Permission** > **User** > **Create** to create a user for the sample project. For details, see :ref:`Creating a User `. + + - Enter a username, for example, **test**. If you want to run a Hive program, enter **hiveuser** in **Username**. + + - Set **User Type** to **Human-Machine**. + + - Enter a password. This password will be used when you run the program. + + - In **User Group**, add **mrgroup** and **supergroup**. + + - Set **Primary Group** to **supergroup** and bind the **mrrole** role to obtain the permission. + + Click **OK**. + +#. .. _mrs_09_0003__en-us_topic_0227922353_li96342010164419: + + Choose **System**. In the navigation pane on the left, choose **Permission** > **User**, locate the row where user **test** locates, and select **Download Authentication Credential** from the **More** drop-down list. Save the downloaded package and decompress it to obtain the **keytab** and **krb5.conf** files. + +.. _mrs_09_0003__en-us_topic_0227922353_section7307144375513: + +Running a MapReduce Program +--------------------------- + +This section describes how to run a MapReduce program in security cluster mode. + +**Prerequisites** + +You have compiled the program and prepared data files, for example, **mapreduce-examples-1.0.jar**, **input_data1.txt**, and **input_data2.txt**.. + +**Procedure** + +#. Use a remote login software (for example, MobaXterm) to log in to the master node of the security cluster using SSH (using the EIP). + +#. After the login is successful, run the following commands to create the **test** folder in the **/opt/Bigdata/client** directory and create the **conf** folder in the **test** directory: + + .. code-block:: + + cd /opt/Bigdata/client + mkdir test + cd test + mkdir conf + +#. Use an upload tool (for example, WinSCP) to copy **mapreduce-examples-1.0.jar**, **input_data1.txt**, and **input_data2.txt** to the **test** directory, and copy the **keytab** and **krb5.conf** files obtained in :ref:`5 ` in **Creating Roles and Users** to the **conf** directory. + +#. Run the following commands to configure environment variables and authenticate the created user, for example, **test**: + + .. code-block:: + + cd /opt/Bigdata/client + source bigdata_env + export YARN_USER_CLASSPATH=/opt/Bigdata/client/test/conf/ + kinit test + + Enter the password as prompted. If no error message is displayed (you need to change the password as prompted upon the first login), Kerberos authentication is complete. + +#. Run the following commands to import data to the HDFS: + + .. code-block:: + + cd test + hdfs dfs -mkdir /tmp/input + hdfs dfs -put input_data* /tmp/input + +#. Run the following commands to run the program: + + .. code-block:: + + yarn jar mapreduce-examples-1.0.jar xxx /tmp/input /tmp/mapreduce_output + + In the preceding commands: + + **/tmp/input** indicates the input path in the HDFS. + + **/tmp/mapreduce_output** indicates the output path in the HDFS. This directory must not exist. Otherwise, an error will be reported. + +#. After the program is executed successfully, run the **hdfs dfs -ls /tmp/mapreduce_output** command. The following command output is displayed. + + + .. figure:: /_static/images/en-us_image_0000001296058144.png + :alt: **Figure 1** Program running result + + **Figure 1** Program running result diff --git a/umn/source/overview/application_scenarios.rst b/umn/source/overview/application_scenarios.rst new file mode 100644 index 0000000..c8a318a --- /dev/null +++ b/umn/source/overview/application_scenarios.rst @@ -0,0 +1,69 @@ +:original_name: mrs_08_0004.html + +.. _mrs_08_0004: + +Application Scenarios +===================== + +Big data is ubiquitous in people's lives. MRS is suitable to process big data in the industries such as the Internet of things (IoT), e-commerce, finance, manufacturing, healthcare, energy, and government departments. + +Large-scale data analysis +------------------------- + +Large-scale data analysis is a major scenario in modern big data systems. Generally, an enterprise has multiple data sources. After data is accessed,extract, transform, and load (ETL) processing is required to generate modelized data for each service module to analyze and sort out data. This type of service has the following characteristics: + +- The requirements for real-time execution are not high, and job execution time ranges from dozens of minutes to hours. +- The data volume is large. +- There are various data sources and diversified formats. +- Data processing usually consists of multiple tasks, and resources need to be planned in detail. + +In the environmental protection industry, climate data is stored on OBS and periodically dumped into HDFS for batch analysis. 10 TB of climate data can be analyzed in 1 hour. + + +.. figure:: /_static/images/en-us_image_0000001349190341.png + :alt: **Figure 1** Large-scale data analysis in the environmental protection industry + + **Figure 1** Large-scale data analysis in the environmental protection industry + +MRS has the following advantages in this scenario. + +- Low cost: OBS offers cost-effective storage. +- Massive data analysis: TB/PB-level data is analyzed by Hive. +- Visualized data import and export tool: Loader exports data to Data Warehouse Service (DWS) for business intelligence (BI) analysis. + +Large-scale data storage +------------------------ + +A user who has a large amount of structured data usually requires index-based quasi-real-time query capabilities. For example, in an Internet of Vehicles (IoV) scenario, vehicle maintenance information is queried by vehicle number. Therefore, vehicle information is indexed based on vehicle numbers when it is being stored, to implement second-level response in this scenario. Generally, the data volume is large. The user may store data for one to three years. + +For example, in the IoV industry, an automobile company stores data on HBase, which supports PB-level storage and CDR queries in milliseconds. + + +.. figure:: /_static/images/en-us_image_0000001296750238.png + :alt: **Figure 2** Large-scale data storage in the IoV industry + + **Figure 2** Large-scale data storage in the IoV industry + +MRS has the following advantages in this scenario. + +- Real time: Kafka accesses massive amounts of vehicle messages in real time. +- Massive data storage: HBase stores massive volumes of data and supports data queries in milliseconds. +- Distributed data query: Spark analyzes and queries massive volumes of data. + +Real-time data processing +------------------------- + +Real-time data processing is usually used in scenarios such as anomaly detection, fraud detection, rule-based alarming, and service process monitoring. Data is processed while it is being inputted to the system. + +For example, in the Internet of elevators & escalators (IoEE) industry, data of smart elevators and escalators is imported to MRS streaming clusters in real time for real-time alarming. + + +.. figure:: /_static/images/en-us_image_0000001349390633.png + :alt: **Figure 3** Low-latency streaming processing in the IoEE industry + + **Figure 3** Low-latency streaming processing in the IoEE industry + +MRS has the following advantages in this scenario. + +- Real-time data ingestion: Flume implements real-time data ingestion and provides various data collection and storage access methods. +- Data source access: Kafka accesses data of tens of thousands of elevators and escalators in real time. diff --git a/umn/source/overview/components/alluxio.rst b/umn/source/overview/components/alluxio.rst new file mode 100644 index 0000000..2fa35e8 --- /dev/null +++ b/umn/source/overview/components/alluxio.rst @@ -0,0 +1,21 @@ +:original_name: mrs_08_0040.html + +.. _mrs_08_0040: + +Alluxio +======= + +Alluxio is data orchestration technology for analytics and AI for the cloud. In the MRS big data ecosystem, Alluxio lies between computing and storage. It provides a data abstraction layer for computing frameworks including Apache Spark, Presto, MapReduce, and Apache Hive, so that upper-layer computing applications can access persistent storage systems including HDFS and OBS through unified client APIs and a global namespace. In this way, computing and storage are separated. + + +.. figure:: /_static/images/en-us_image_0000001296430738.png + :alt: **Figure 1** Alluxio architecture + + **Figure 1** Alluxio architecture + +Advantages: + +- Provides in-memory I/O throughput, and makes elastically scale data-driven applications cost effective. +- Simplified cloud and object storage access +- Simplified data management and a single point of access to multiple data sources +- Easy application deployment diff --git a/umn/source/overview/components/carbondata.rst b/umn/source/overview/components/carbondata.rst new file mode 100644 index 0000000..ba04758 --- /dev/null +++ b/umn/source/overview/components/carbondata.rst @@ -0,0 +1,38 @@ +:original_name: mrs_08_0015.html + +.. _mrs_08_0015: + +CarbonData +========== + +CarbonData is a new Apache Hadoop native data-store format. CarbonData allows faster interactive queries over PetaBytes of data using advanced columnar storage, index, compression, and encoding techniques to improve computing efficiency. In addition, CarbonData is also a high-performance analysis engine that integrates data sources with Spark. + + +.. figure:: /_static/images/en-us_image_0000001349110485.png + :alt: **Figure 1** Basic architecture of CarbonData + + **Figure 1** Basic architecture of CarbonData + +The purpose of using CarbonData is to provide quick response to ad hoc queries of big data. Essentially, CarbonData is an Online Analytical Processing (OLAP) engine, which stores data using tables similar to those in Relational Database Management System (RDBMS). You can import more than 10 TB data to tables created in CarbonData format, and CarbonData automatically organizes and stores data using the compressed multi-dimensional indexes. After data is loaded to CarbonData, CarbonData responds to ad hoc queries in seconds. + +CarbonData integrates data sources into the Spark ecosystem. You can use Spark SQL to query and analyze data, or use the third-party tool ThriftServer provided by Spark to connect to Spark SQL. + +**CarbonData features** + +- SQL: CarbonData is compatible with Spark SQL and supports SQL query operations performed on Spark SQL. +- Simple Table dataset definition: CarbonData allows you to define and create datasets by using user-friendly Data Definition Language (DDL) statements. CarbonData DDL is flexible and easy to use, and can define complex tables. +- Easy data management: CarbonData provides various data management functions for data loading and maintenance. It can load historical data and incrementally load new data. The loaded data can be deleted according to the loading time and specific data loading operations can be canceled. +- CarbonData file format is a columnar store in HDFS. It has many features that a modern columnar format has, such as splittable and compression schema. + +**Unique features of CarbonData** + +- Stores data along with index: Significantly accelerates query performance and reduces the I/O scans and CPU resources, when there are filters in the query. CarbonData index consists of multiple levels of indices. A processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also perform skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file. +- Operable encoded data: Through supporting efficient compression and global encoding schemes, CarbonData can query on compressed/encoded data. The data can be converted just before returning the results to the users, which is "late materialized". +- Supports various use cases with one single data format: like interactive OLAP-style query, Sequential Access (big scan), and Random Access (narrow scan). + +**Key technologies and advantages of CarbonData** + +- Quick query response: CarbonData features high-performance query. The query speed of CarbonData is 10 times of that of Spark SQL. It uses dedicated data formats and applies multiple index technologies, global dictionary code, and multiple push-down optimizations, providing quick response to TB-level data queries. +- Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression algorithms. This significantly saves 60% to 80% data storage space and the hardware storage cost. + +For details about CarbonData architecture and principles, see https://carbondata.apache.org/. diff --git a/umn/source/overview/components/clickhouse.rst b/umn/source/overview/components/clickhouse.rst new file mode 100644 index 0000000..e554d70 --- /dev/null +++ b/umn/source/overview/components/clickhouse.rst @@ -0,0 +1,129 @@ +:original_name: mrs_08_0108.html + +.. _mrs_08_0108: + +ClickHouse +========== + +Introduction to ClickHouse +-------------------------- + +ClickHouse is an open-source columnar database oriented to online analysis and processing. It is independent of the Hadoop big data system and features ultimate compression rate and fast query performance. In addition, ClickHouse supports SQL query and provides good query performance, especially the aggregation analysis and query performance based on large and wide tables. The query speed is one order of magnitude faster than that of other analytical databases. + +The core functions of ClickHouse are as follows: + +**Comprehensive DBMS functions** + +ClickHouse has comprehensive database management functions, including the basic functions of a Database Management System (DBMS): + +- Data Definition Language (DDL): allows databases, tables, and views to be dynamically created, modified, or deleted without restarting services. +- Data Manipulation Language (DML): allows data to be queried, inserted, modified, or deleted dynamically. +- Permission control: supports user-based database or table operation permission settings to ensure data security. +- Data backup and restoration: supports data backup, export, import, and restoration to meet the requirements of the production environment. +- Distributed management: provides the cluster mode to automatically manage multiple database nodes. + +**Column-based storage and data compression** + +ClickHouse is a database that uses column-based storage. Data is organized by column. Data in the same column is stored together, and data in different columns is stored in different files. + +During data query, columnar storage can reduce the data scanning range and data transmission size, thereby improving data query efficiency. + +In a traditional row-based database system, data is stored in the sequence in :ref:`Table 1 `: + +.. _mrs_08_0108__table149821347311: + +.. table:: **Table 1** Row-based database + + === =========== ==== ===== ===== =============== + row ID Flag Name Event Time + === =========== ==== ===== ===== =============== + 0 12345678901 0 name1 1 2020/1/11 15:19 + 1 32345678901 1 name2 1 2020/5/12 18:10 + 2 42345678901 1 name3 1 2020/6/13 17:38 + N ... ... ... ... ... + === =========== ==== ===== ===== =============== + +In a row-based database, data in the same row is physically stored together. In a column-based database system, data is stored in the sequence in :ref:`Table 2 `: + +.. _mrs_08_0108__table1835171820320: + +.. table:: **Table 2** Columnar database + + ====== =============== =============== =============== === + row: 0 1 2 N + ID: 12345678901 32345678901 42345678901 ... + Flag: 0 1 1 ... + Name: name1 name2 name3 ... + Event: 1 1 1 ... + Time: 2020/1/11 15:19 2020/5/12 18:10 2020/6/13 17:38 ... + ====== =============== =============== =============== === + +This example shows only the arrangement of data in a columnar database. Columnar databases store data in the same column together and data in different columns separately. Columnar databases are more suitable for online analytical processing (OLAP) scenarios. + +**Vectorized executor** + +ClickHouse uses CPU's Single Instruction Multiple Data (SIMD) to implement vectorized execution. SIMD is an implementation mode that uses a single instruction to operate multiple pieces of data and improves performance with data parallelism (other methods include instruction-level parallelism and thread-level parallelism). The principle of SIMD is to implement parallel data operations at the CPU register level. + +**Relational model and SQL query** + +ClickHouse uses SQL as the query language and provides standard SQL query APIs for existing third-party analysis visualization systems to easily integrate with ClickHouse. + +In addition, ClickHouse uses a relational model. Therefore, the cost of migrating the system built on a traditional relational database or data warehouse to ClickHouse is lower. + +**Data sharding and distributed query** + +The ClickHouse cluster consists of one or more shards, and each shard corresponds to one ClickHouse service node. The maximum number of shards depends on the number of nodes (one shard corresponds to only one service node). + +ClickHouse introduces the concepts of local table and distributed table. A local table is equivalent to a data shard. A distributed table itself does not store any data. It is an access proxy of the local table and functions as the sharding middleware. With the help of distributed tables, multiple data shards can be accessed by using the proxy, thereby implementing distributed query. + +ClickHouse Applications +----------------------- + +ClickHouse is short for Click Stream and Data Warehouse. It is initially applied to a web traffic analysis tool to perform OLAP analysis for data warehouses based on page click event flows. Currently, ClickHouse is widely used in Internet advertising, app and web traffic analysis, telecommunications, finance, and Internet of Things (IoT) fields. It is applicable to business intelligence application scenarios and has a large number of applications and practices worldwide. For details, visit https://clickhouse.tech/docs/en/introduction/adopters/. + +ClickHouse Enhanced Open Source Features +---------------------------------------- + +MRS ClickHouse has advantages such as automatic cluster mode, HA deployment, and smooth and elastic scaling. + +- Automatic Cluster Mode + + As shown in :ref:`Figure 1 `, a cluster consists of multiple ClickHouse nodes, which has no central node. It is more of a static resource pool. If the ClickHouse cluster mode is used for services, you need to pre-define the cluster information in the configuration file of each node. Only in this way, services can be correctly accessed. + + .. _mrs_08_0108__fig79238920553: + + .. figure:: /_static/images/en-us_image_0000001349110441.png + :alt: **Figure 1** ClickHouse cluster + + **Figure 1** ClickHouse cluster + + Users are unaware of data partitions and replica storage in common database systems. However, ClickHouse allows you to proactively plan and define detailed configurations such as shards, partitions, and replica locations. The ClickHouse instance of MRS packs the work in a unified manner and adapts it to the automatic mode, implementing unified management, which is flexible and easy to use. A ClickHouse instance consists of three ZooKeeper nodes and multiple ClickHouse nodes. The Dedicated Replica mode is used to ensure high reliability of dual data copies. + + + .. figure:: /_static/images/en-us_image_0000001349390609.png + :alt: **Figure 2** ClickHouse cluster structure + + **Figure 2** ClickHouse cluster structure + +- Smooth and Elastic Scaling + + As business grows rapidly, MRS provides ClickHouse, a data migration tool, for scenarios such as the cluster's storage capacity or CPU compute resources approaching the limit. This tool is used to migrate some partitions of one or multiple MergeTree tables on several ClickHouseServer nodes to the same tables on other ClickHouseServer nodes. In this way, service availability is ensured and smooth capacity expansion is implemented. + + When you add ClickHouse nodes to a cluster, use this tool to migrate some data from the existing nodes to the new ones for data balancing after the expansion. + + |image1| + +- HA Deployment Architecture + + MRS uses the ELB-based high availability (HA) deployment architecture to automatically distribute user access traffic to multiple backend nodes, expanding service capabilities to external systems and improving fault tolerance. As shown in :ref:`Figure 3 `, when a client application requests a cluster, Elastic Load Balance (ELB) is used to distribute traffic. With the ELB polling mechanism, data is written to local tables and read from distributed tables on different nodes. In this way, data read/write load and high availability of application access are guaranteed. + + After the ClickHouse cluster is provisioned, each ClickHouse instance node in the cluster corresponds to a replica, and two replicas form a logical shard. For example, when creating a ReplicatedMergeTree table, you can specify shards so that data can be automatically synchronized between two replicas in the same shard. + + .. _mrs_08_0108__fig15273873411: + + .. figure:: /_static/images/en-us_image_0000001296590598.png + :alt: **Figure 3** HA deployment architecture + + **Figure 3** HA deployment architecture + +.. |image1| image:: /_static/images/en-us_image_0000001296270774.png diff --git a/umn/source/overview/components/dbservice/dbservice_basic_principles.rst b/umn/source/overview/components/dbservice/dbservice_basic_principles.rst new file mode 100644 index 0000000..181a1e9 --- /dev/null +++ b/umn/source/overview/components/dbservice/dbservice_basic_principles.rst @@ -0,0 +1,45 @@ +:original_name: mrs_08_00601.html + +.. _mrs_08_00601: + +DBService Basic Principles +========================== + +Overview +-------- + +DBService is a HA storage system for relational databases, which is applicable to the scenario where a small amount of data (about 10 GB) needs to be stored, for example, component metadata. DBService can only be used by internal components of a cluster and provides data storage, query, and deletion functions. + +DBService is a basic component of a cluster. Components such as Hive, Hue, Oozie, Loader, and Redis, and Loader store their metadata in DBService, and provide the metadata backup and restoration functions by using DBService. + +DBService Architecture +---------------------- + +DBService in the cluster works in active/standby mode. Two DBServer instances are deployed and each instance contains three modules: HA, Database, and FloatIP. + +:ref:`Figure 1 ` shows the DBService logical architecture. + +.. _mrs_08_00601__fig12670195704016: + +.. figure:: /_static/images/en-us_image_0000001349390605.png + :alt: **Figure 1** DBService architecture + + **Figure 1** DBService architecture + +:ref:`Table 1 ` describes the modules shown in :ref:`Figure 1 ` + +.. _mrs_08_00601__table51425253145337: + +.. table:: **Table 1** Module description + + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +==========+=====================================================================================================================================================================================================================+ + | HA | HA management module. The active/standby DBServer uses the HA module for management. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Database | Database module. This module stores the metadata of the Client module. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FloatIP | Floating IP address that provides the access function externally. It is enabled only on the active DBServer instance and is used by the Client module to access Database. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Client | Client using the DBService component, which is deployed on the component instance node. The client connects to the database by using FloatIP and then performs metadata adding, deleting, and modifying operations. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/overview/components/dbservice/index.rst b/umn/source/overview/components/dbservice/index.rst new file mode 100644 index 0000000..4d512aa --- /dev/null +++ b/umn/source/overview/components/dbservice/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0060.html + +.. _mrs_08_0060: + +DBService +========= + +- :ref:`DBService Basic Principles ` +- :ref:`Relationship Between DBService and Other Components ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + dbservice_basic_principles + relationship_between_dbservice_and_other_components diff --git a/umn/source/overview/components/dbservice/relationship_between_dbservice_and_other_components.rst b/umn/source/overview/components/dbservice/relationship_between_dbservice_and_other_components.rst new file mode 100644 index 0000000..f6c4866 --- /dev/null +++ b/umn/source/overview/components/dbservice/relationship_between_dbservice_and_other_components.rst @@ -0,0 +1,8 @@ +:original_name: mrs_08_00602.html + +.. _mrs_08_00602: + +Relationship Between DBService and Other Components +=================================================== + +DBService is a basic component of a cluster. Components such as Hive, Hue, Oozie, Loader, Metadata, and Redis, and Loader store their metadata in DBService, and provide the metadata backup and restoration functions by using DBService. diff --git a/umn/source/overview/components/flink/flink_basic_principles.rst b/umn/source/overview/components/flink/flink_basic_principles.rst new file mode 100644 index 0000000..6e35177 --- /dev/null +++ b/umn/source/overview/components/flink/flink_basic_principles.rst @@ -0,0 +1,172 @@ +:original_name: mrs_08_00341.html + +.. _mrs_08_00341: + +Flink Basic Principles +====================== + +Overview +-------- + +Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is a top open source stream processing engine in the industry. + +Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability, making it extremely suitable for low-latency data processing. + +:ref:`Figure 1 ` shows the technology stack of Flink. + +.. _mrs_08_00341__fca1fea71ad8a4d748cee96d9d10bc4a6: + +.. figure:: /_static/images/en-us_image_0000001296750242.png + :alt: **Figure 1** Technology stack of Flink + + **Figure 1** Technology stack of Flink + +Flink provides the following features in the current version: + +- DataStream +- Checkpoint +- Window +- Job Pipeline +- Configuration Table + +Other features are inherited from the open source community and are not enhanced. + +Flink Architecture +------------------ + +:ref:`Figure 2 ` shows the Flink architecture. + +.. _mrs_08_00341__f58539f3d230744ce84f0255e8938c4e7: + +.. figure:: /_static/images/en-us_image_0000001349390641.png + :alt: **Figure 2** Flink architecture + + **Figure 2** Flink architecture + +As shown in the above figure, the entire Flink system consists of three parts: + +- Client + + Flink client is used to submit jobs (streaming jobs) to Flink. + +- TaskManager + + TaskManager is a service execution node of Flink. It executes specific tasks. A Flink system can have multiple TaskManagers. These TaskManagers are equivalent to each other. + +- JobManager + + JobManager is a management node of Flink. It manages all TaskManagers and schedules tasks submitted by users to specific TaskManagers. In high-availability (HA) mode, multiple JobManagers are deployed. Among these JobManagers, one is selected as the active JobManager, and the others are standby. + +Flink Principles +---------------- + +- **Stream & Transformation & Operator** + + A Flink program consists of two building blocks: stream and transformation. + + #. Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result. + + #. When a Flink program is executed, it is mapped to a streaming dataflow. A streaming dataflow consists of a group of streams and transformation operators. Each dataflow starts with one or more source operators and ends in one or more sink operators. A dataflow resembles a directed acyclic graph (DAG). + + :ref:`Figure 3 ` shows the streaming dataflow to which a Flink program is mapped. + + .. _mrs_08_00341__f852ce1184f7f465c9382dd784e68f028: + + .. figure:: /_static/images/en-us_image_0000001296590630.png + :alt: **Figure 3** Example of Flink DataStream + + **Figure 3** Example of Flink DataStream + + As shown in :ref:`Figure 3 `, **FlinkKafkaConsumer** is a source operator; Map, KeyBy, TimeWindow, and Apply are transformation operators; RollingSink is a sink operator. + +- **Pipeline Dataflow** + + Applications in Flink can be executed in parallel or distributed modes. A stream can be divided into one or more stream partitions, and an operator can be divided into multiple operator subtasks. + + The executor of streams and operators are automatically optimized based on the density of upstream and downstream operators. + + - Operators with low density cannot be optimized. Each operator subtask is separately executed in different threads. The number of operator subtasks is the parallelism of that particular operator. The parallelism (the total number of partitions) of a stream is that of its producing operator. Different operators of the same program may have different levels of parallelism, as shown in :ref:`Figure 4 `. + + .. _mrs_08_00341__f0f57e24f8dce442b97c4155409c65695: + + .. figure:: /_static/images/en-us_image_0000001296750246.png + :alt: **Figure 4** Operator + + **Figure 4** Operator + + - Operators with high density can be optimized. Flink chains operator subtasks together into a task, that is, an operator chain. Each operator chain is executed by one thread on TaskManager, as shown in :ref:`Figure 5 `. + + .. _mrs_08_00341__fig1619693311205: + + .. figure:: /_static/images/en-us_image_0000001296430770.png + :alt: **Figure 5** Operator chain + + **Figure 5** Operator chain + + - In the upper part of :ref:`Figure 5 `, the condensed Source and Map operators are chained into an Operator Chain, that is, a larger operator. The Operator Chain, KeyBy, and Sink all represent an operator respectively and are connected with each other through streams. Each operator corresponds to one task during the running. Namely, there are three tasks in the upper part. + - In the lower part of :ref:`Figure 5 `, each task, except Sink, is paralleled into two subtasks. The parallelism of the Sink operator is one. + +Key Features +------------ + +- Stream processing + + The real-time stream processing engine features high throughput, high performance, and low latency, which can provide processing capability within milliseconds. + +- Various status management + + The stream processing application needs to store the received events or intermediate result in a certain period of time for subsequent access and processing at a certain time point. Flink provides diverse features for status management, including: + + - Multiple basic status types: Flink provides various states for data structures, such as ValueState, ListState, and MapState. Users can select the most efficient and suitable status type based on the service model. + - Rich State Backend: State Backend manages the status of applications and performs Checkpoint operations as required. Flink provides different State Backends. State can be stored in the memory or RocksDB, and supports the asynchronous and incremental Checkpoint mechanism. + - Exactly-once state consistency: The Checkpoint and fault recovery capabilities of Flink ensure that the application status of tasks is consistent before and after a fault occurs. Flink supports transactional output for some specific storage devices. In this way, exactly-once output can be ensured even when a fault occurs. + +- Various time semantics + + Time is an important part of stream processing applications. For real-time stream processing applications, operations such as window aggregation, detection, and matching based on time semantics are very common. Flink provides various time semantics. + + - Event-time: The timestamp provided by the event is used for calculation, making it easier to process the events that arrive at a random sequence or arrive late. + - Watermark: Flink introduces the concept of Watermark to measure the development of event time. Watermark also provides flexible assurance for balancing processing latency and data integrity. When processing event streams with Watermark, Flink provides multiple processing options if data arrives after the calculation, for example, redirecting data (side output) or updating the calculation result. + - Processing-time and Ingestion-time are supported. + - Highly flexible streaming window: Flink supports the time window, count window, session window, and data-driven customized window. You can customize the triggering conditions to implement the complex streaming calculation mode. + +- Fault tolerance mechanism + + In a distributed system, if a single task or node breaks down or is faulty, the entire task may fail. Flink provides a task-level fault tolerance mechanism, which ensures that user data is not lost when an exception occurs in a task and can be automatically restored. + + - Checkpoint: Flink implements fault tolerance based on checkpoint. Users can customize the checkpoint policy for the entire task. When a task fails, the task can be restored to the status of the latest checkpoint and data after the snapshot is resent from the data source. + - Savepoint: A savepoint is a consistent snapshot of application status. The savepoint mechanism is similar to that of checkpoint. However, the savepoint mechanism needs to be manually triggered. The savepoint mechanism ensures that the status information of the current stream application is not lost during task upgrade or migration, facilitating task suspension and recovery at any time point. + +- Flink SQL + + Table APIs and SQL use Apache Calcite to parse, verify, and optimize queries. Table APIs and SQL can be seamlessly integrated with DataStream and DataSet APIs, and support user-defined scalar functions, aggregation functions, and table value functions. The definition of applications such as data analysis and ETL is simplified. The following code example shows how to use Flink SQL statements to define a counting application that records session times. + + .. code-block:: + + SELECT userId, COUNT(*) + FROM clicks + GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId + +- CEP in SQL + + Flink allows users to represent complex event processing (CEP) query results in SQL for pattern matching and evaluate event streams on Flink. + + CEP SQL is implemented through the **MATCH_RECOGNIZE** SQL syntax. The **MATCH_RECOGNIZE** clause is supported by Oracle SQL since Oracle Database 12c and is used to indicate event pattern matching in SQL. The following is an example of CEP SQL: + + .. code-block:: + + SELECT T.aid, T.bid, T.cid + FROM MyTable + MATCH_RECOGNIZE ( + PARTITION BY userid + ORDER BY proctime + MEASURES + A.id AS aid, + B.id AS bid, + C.id AS cid + PATTERN (A B C) + DEFINE + A AS name = 'a', + B AS name = 'b', + C AS name = 'c' + ) AS T diff --git a/umn/source/overview/components/flink/flink_enhanced_open_source_features/flink_cep_in_sql.rst b/umn/source/overview/components/flink/flink_enhanced_open_source_features/flink_cep_in_sql.rst new file mode 100644 index 0000000..6ddc375 --- /dev/null +++ b/umn/source/overview/components/flink/flink_enhanced_open_source_features/flink_cep_in_sql.rst @@ -0,0 +1,114 @@ +:original_name: mrs_08_00349.html + +.. _mrs_08_00349: + +Flink CEP in SQL +================ + + +Flink CEP in SQL +---------------- + +Flink allows users to represent complex event processing (CEP) query results in SQL for pattern matching and evaluate event streams on Flink engines. + +SQL Query Syntax +---------------- + +CEP SQL is implemented through the **MATCH_RECOGNIZE** SQL syntax. The **MATCH_RECOGNIZE** clause is supported by Oracle SQL since Oracle Database 12c and is used to indicate event pattern matching in SQL. Apache Calcite also supports the **MATCH_RECOGNIZE** clause. + +Flink uses Calcite to analyze SQL query results. Therefore, this operation complies with the Apache Calcite syntax. + +.. code-block:: + + MATCH_RECOGNIZE ( + [ PARTITION BY expression [, expression ]* ] + [ ORDER BY orderItem [, orderItem ]* ] + [ MEASURES measureColumn [, measureColumn ]* ] + [ ONE ROW PER MATCH | ALL ROWS PER MATCH ] + [ AFTER MATCH + ( SKIP TO NEXT ROW + | SKIP PAST LAST ROW + | SKIP TO FIRST variable + | SKIP TO LAST variable + | SKIP TO variable ) + ] + PATTERN ( pattern ) + [ WITHIN intervalLiteral ] + [ SUBSET subsetItem [, subsetItem ]* ] + DEFINE variable AS condition [, variable AS condition ]* + ) + +The syntax elements of the **MATCH_RECOGNIZE** clause are defined as follows: + +(Optional) **-PARTITION BY**: defines partition columns. This clause is optional. If this parameter is not defined, the parallelism 1 is used. + +(Optional) **-ORDER BY**: defines the sequence of events in a data flow. The **ORDER BY** clause is optional. If it is ignored, non-deterministic sorting is used. Since the order of events is important in pattern matching, this clause should be specified in most cases. + +(Optional) **-MEASURES**: specifies the attribute value of the successfully matched event. + +(Optional) **-ONE ROW PER MATCH \| ALL ROWS PER MATCH**: defines how to output the result. **ONE ROW PER MATCH** indicates that only one row is output for each matching. **ALL ROWS PER MATCH** indicates that one row is output for each matching event. + +(Optional) **-AFTER MATCH**: specifies the start position for processing after the next pattern is successfully matched. + +**-PATTERN**: defines the matching pattern as a regular expression. The following operators can be used in the **PATTERN** clause: join operators, quantifier operators (``*``, +, ?, {n}, {n,}, {n,m}, and {,m}), branch operators (vertical bar \|), and differential operators ('{- -}'). + +(Optional) **-WITHIN**: outputs a pattern clause match only when the match occurs within the specified time. + +(Optional) **-SUBSET**: combines one or more associated variables defined in the **DEFINE** clause. + +**-DEFINE**: specifies the Boolean condition, which defines the variables used in the **PATTERN** clause. + +In addition, the **MATCH_RECOGNIZE** clause supports the following functions: + +**-MATCH_NUMBER()**: Used in the **MEASURES** clause to allocate the same number to each row that is successfully matched. + +**-CLASSIFIER()**: Used in the **MEASURES** clause to indicate the mapping between matched rows and variables. + +**-FIRST()** and **LAST()**: Used in the **MEASURES** clause to return the value of the expression evaluated in the first or last row of the row set mapped to the schema variable. + +**-NEXT()** and **PREV()**: Used in the **DEFINE** clause to evaluate an expression using the previous or next row in a partition. + +**-RUNNING** and **FINAL** keywords: Used to determine the semantics required for aggregation. **RUNNING** can be used in the **MEASURES** and **DEFINE** clauses, whereas **FINAL** can be used only in the **MEASURES** clause. + +- Aggregate functions (**COUNT**, **SUM**, **AVG**, **MAX**, **MIN**): Used in the **MEASURES** and **DEFINE** clauses. + +Query Example +------------- + +The following query finds the V-shaped pattern in the stock price data flow. + +.. code-block:: + + SELECT * + FROM MyTable + MATCH_RECOGNIZE ( + ORDER BY rowtime + MEASURES + STRT.name as s_name, + LAST(DOWN.name) as down_name, + LAST(UP.name) as up_name + ONE ROW PER MATCH + PATTERN (STRT DOWN+ UP+) + DEFINE + DOWN AS DOWN.v < PREV(DOWN.v), + UP AS UP.v > PREV(UP.v) + ) + +In the following query, the aggregate function **AVG** is used in the **MEASURES** clause of **SUBSET E** consisting of variables related to A and C. + +.. code-block:: + + SELECT * + FROM Ticker + MATCH_RECOGNIZE ( + MEASURES + AVG(E.price) AS avgPrice + ONE ROW PER MATCH + AFTER MATCH SKIP PAST LAST ROW + PATTERN (A B+ C) + SUBSET E = (A,C) + DEFINE + A AS A.price < 30, + B AS B.price < 20, + C AS C.price < 30 + ) diff --git a/umn/source/overview/components/flink/flink_enhanced_open_source_features/index.rst b/umn/source/overview/components/flink/flink_enhanced_open_source_features/index.rst new file mode 100644 index 0000000..73b24ab --- /dev/null +++ b/umn/source/overview/components/flink/flink_enhanced_open_source_features/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_00344.html + +.. _mrs_08_00344: + +Flink Enhanced Open Source Features +=================================== + +- :ref:`Window ` +- :ref:`Job Pipeline ` +- :ref:`Stream SQL Join ` +- :ref:`Flink CEP in SQL ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + window + job_pipeline + stream_sql_join + flink_cep_in_sql diff --git a/umn/source/overview/components/flink/flink_enhanced_open_source_features/job_pipeline.rst b/umn/source/overview/components/flink/flink_enhanced_open_source_features/job_pipeline.rst new file mode 100644 index 0000000..20bb877 --- /dev/null +++ b/umn/source/overview/components/flink/flink_enhanced_open_source_features/job_pipeline.rst @@ -0,0 +1,190 @@ +:original_name: mrs_08_00346.html + +.. _mrs_08_00346: + +Job Pipeline +============ + +Enhanced Open Source Feature: Job Pipeline +------------------------------------------ + +Generally, logic code related to a service is stored in a large JAR package, which is called Fat JAR. Disadvantages of Fat JAR are as follows: + +- When service logic becomes more and more complex, the size of the Fat JAR increases. +- Fat Jar makes coordination complex. Developers of all services are working with the same service logic. Even though the service logic can be divided into several modules, all modules are tightly coupled with each other. If the requirement needs to be changed, the entire flow diagram needs to be replanned. + +Splitting of jobs is facing the following problems: + +- Data transmission between jobs can be achieved using Kafka. For example, job A transmits data to the topic A in Kafka, and then job B and job C read data from the topic A in Kafka. This solution is simple and easy to implement, but the latency is always longer than 100 ms. +- Operators are connected using the TCP protocol. In distributed environment, operators can be scheduled to any node and upstream and downstream services cannot detect the scheduling. + +**Job Pipeline** + +A pipeline consists of multiple Flink jobs connected through TCP. Upstream jobs can send data to downstream jobs. The flow diagram about data transmission is called a job pipeline, as shown in :ref:`Figure 1 `. + +.. _mrs_08_00346__f9376509c971b4e4b99bd76fd4ce17b63: + +.. figure:: /_static/images/en-us_image_0000001349390637.png + :alt: **Figure 1** Job pipeline + + **Figure 1** Job pipeline + +**Job Pipeline Principles** + + +.. figure:: /_static/images/en-us_image_0000001349110473.png + :alt: **Figure 2** Job pipeline principles + + **Figure 2** Job pipeline principles + +- NettySink and NettySource + + In a pipeline, upstream jobs and downstream jobs communicate with each other through Netty. The Sink operator of the upstream job works as a server and the Source operator of the downstream job works as a client. The Sink operator of the upstream job is called NettySink, and the Source operator of the downstream job is called NettySource. + +- NettyServer and NettyClient + + NettySink functions as the server of Netty. In NettySink, NettyServer achieves the function of a server. NettySource functions as the client of Netty. In NettySource, NettyClient achieves the function of a client. + +- Publisher + + The job that sends data to downstream jobs through NettySink is called a publisher. + +- Subscriber + + The job that receives data from upstream jobs through NettySource is called a subscriber. + +- RegisterServer + + RegisterServer is the third-party memory that stores the IP address, port number, and concurrency information about NettyServer. + +- The general outside-in architecture is as follows: + + - NettySink->NettyServer->NettyServerHandler + - NettySource->NettyClient->NettyClientHandler + +**Job Pipeline Functions** + +- **NettySink** + + NettySink consists of the following major modules: + + - RichParallelSinkFunction + + NettySink inherits RichParallelSinkFunction and attributes of Sink operators. The RichParallelSinkFunction API implements following functions: + + - Starts the NettySink operator. + - Runs the NettySink operator and receives data from the upstream operator. + - Cancels the running of NettySink operators. + + Following information can be obtained using the attribute of RichParallelSinkFunction: + + - subtaskIndex about the concurrency of each NettySink operator. + - Concurrency of the NettySink operator. + + - RegisterServerHandler + + RegisterServerHandler interacts with the component of RegisterServer and defines following APIs: + + - **start();**: Starts the RegisterServerHandler and establishes a contact with the third-party RegisterServer. + - **createTopicNode();**: Creates a topic node. + - **register();**: Registers information such as the IP address, port number, and concurrency to the topic node. + - **deleteTopicNode();**: Deletes a topic node. + - **unregister();**: Deletes registration information. + - **query();**: Queries registration information. + - **isExist();**: Verifies that a specific piece of information exists. + - **shutdown();**: Disables the RegisterServerHandler and disconnects from the third-party RegisterServer. + + .. note:: + + - RegisterServerHandler API enables ZooKeeper to work as the handler of RegisterServer. You can customize your handler as required. Information is stored in ZooKeeper in the following form: + + .. code-block:: + + Namespace + |---Topic-1 + |---parallel-1 + |---parallel-2 + |.... + |---parallel-n + |---Topic-2 + |---parallel-1 + |---parallel-2 + |.... + |---parallel-m + |... + + - Information about NameSpace can be obtained from the following parameters of the **flink-conf.yaml** file: + + .. code-block:: + + nettyconnector.registerserver.topic.storage: /flink/nettyconnector + + - The simple authentication and security layer (SASL) authentication between ZookeeperRegisterServerHandler and ZooKeeper is implemented through the Flink framework. + + - Ensure that each job has a unique topic. Otherwise, the subscription relationship may be unclear. + + - When calling **shutdown()**, ZookeeperRegisterServerHandler deletes the registration information about the current concurrency, and then attempts to delete the topic node. If the topic node is not empty, deletion will be canceled, because not all concurrency has exited. + + - NettyServer + + NettyServer is the core of the NettySink operator, whose main function is to create a NettyServer and receive connection requests from NettyClient. Use NettyServerHandler to send data received from upstream operators of a same job. The port number and subnet of NettyServer needs to be configured in the **flink-conf.yaml** file. + + - Port range + + .. code-block:: + + nettyconnector.sinkserver.port.range: 28444-28943 + + - Subnet + + .. code-block:: + + nettyconnector.sinkserver.subnet: 10.162.222.123/24 + + .. note:: + + The **nettyconnector.sinkserver.subnet** parameter is set to the subnet (service IP address) of the Flink client by default. If the client and TaskManager are not in the same subnet, an error may occur. Therefore, you need to manually set this parameter to the subnet (service IP address) of TaskManager. + + - NettyServerHandler + + The handler enables the interaction between NettySink and subscribers. After NettySink receives messages, the handler sends these messages out. To ensure data transmission security, this channel is encrypted using SSL. The **nettyconnector.ssl.enabled** configures whether to enable SSL encryption. The SSL encryption is enabled only when **nettyconnector.ssl.enabled** is set to **true**. + +- **NettySource** + + NettySource consists of the following major modules: + + - RichParallelSourceFunction + + NettySource inherits RichParallelSinkFunction and attributes of Source operators. The RichParallelSourceFunction API implements following functions: + + - Starts the NettySink operator. + - Runs the NettySink operator, receives data from subscribers, and injects the data to jobs. + - Cancels the running of Source operators. + + Following information can be obtained using the attribute of RichParallelSourceFunction: + + - subtaskIndex about the concurrency of each NettySource operator. + - Concurrency of the NettySource operator. + + When the NettySource operator enters the running stage, the NettyClient status is monitored. Once abnormality occurs, NettyClient is restarted and reconnected to NettyServer, preventing data confusion. + + - RegisterServerHandler + + RegisterServerHandler of NettySource has similar function as the RegisterServerHandler of NettySink. It obtains the IP address, port number, and information of concurrent operators of each subscribed job obtained in the NettySource operator. + + - NettyClient + + NettyClient establishes a connection with NettyServer and uses NettyClientHandler to receive data. Each NettySource operator must have a unique name (specified by the user). NettyServer determines whether each client comes from different NettySources based on unique names. When a connection is established between NettyClient and NettyServer, NettyClient is registered with NettyServer and the NettySource name of NettyClient is transferred to NettyServer. + + - NettyClientHandler + + The NettyClientHandler enables the interaction with publishers and other operators of the job. When messages are received, NettyClientHandler transfers these messages to the job. To ensure secure data transmission, SSL encryption is enabled for the communication with NettySink. The SSL encryption is enabled only when SSL is enabled and **nettyconnector.ssl.enabled** is set to **true**. + +The relationship between the jobs may be many-to-many. The concurrency between each NettySink and NettySource operator is one-to-many, as shown in :ref:`Figure 3 `. + +.. _mrs_08_00346__f06f4b6e5263b4e2780ee1688783def5f: + +.. figure:: /_static/images/en-us_image_0000001349190345.png + :alt: **Figure 3** Relationship diagram + + **Figure 3** Relationship diagram diff --git a/umn/source/overview/components/flink/flink_enhanced_open_source_features/stream_sql_join.rst b/umn/source/overview/components/flink/flink_enhanced_open_source_features/stream_sql_join.rst new file mode 100644 index 0000000..92016ab --- /dev/null +++ b/umn/source/overview/components/flink/flink_enhanced_open_source_features/stream_sql_join.rst @@ -0,0 +1,41 @@ +:original_name: mrs_08_00348.html + +.. _mrs_08_00348: + +Stream SQL Join +=============== + +Enhanced Open Source Feature: Stream SQL Join +--------------------------------------------- + +Flink's Table API&SQL is an integrated query API for Scala and Java that allows the composition of queries from relational operators such as selection, filter, and join in an intuitive way. For details about Table API&SQL, visit the official website at https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/index.html. + +**Introduction to Stream SQL Join** + +SQL Join is used to query data based on the relationship between columns in two or more tables. Flink Stream SQL Join allows you to join two streaming tables and query results from them. Queries similar to the following are supported: + +.. code-block:: + + SELECT o.proctime, o.productId, o.orderId, s.proctime AS shipTime + FROM Orders AS o + JOIN Shipments AS s + ON o.orderId = s.orderId + AND o.proctime BETWEEN s.proctime AND s.proctime + INTERVAL '1' HOUR; + +Currently, Stream SQL Join needs to be performed within a specified window. The join operation for data within the window requires at least one equi-join predicate and a join condition that bounds the time on both sides. Such a condition can be defined by two appropriate range predicates (**<**, **<=**, **>=**, **>**), a **BETWEEN** predicate, or a single equality predicate that compares the same type of time attributes (such as processing time or event time) of both input tables. + +The following example will join all orders with their corresponding shipments if the order was shipped four hours after the order was received. + +.. code-block:: + + SELECT * + FROM Orders o, Shipments s + WHERE o.id = s.orderId AND + o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime + +.. note:: + + #. Stream SQL Join supports only inner join. + #. The **ON** clause should include an equal join condition. + #. Time attributes support only the processing time and event time. + #. The window condition supports only the bounded time range, for example, **o.proctime BETWEEN s.proctime - INTERVAL '1' HOUR AND s.proctime + INTERVAL '1' HOUR**. The unbounded range such as **o. proctime > s.proctime** is not supported. The **proctime** attribute of two streams must be included. **o.proctime BETWEEN proctime () AND proctime () + 1** is not supported. diff --git a/umn/source/overview/components/flink/flink_enhanced_open_source_features/window.rst b/umn/source/overview/components/flink/flink_enhanced_open_source_features/window.rst new file mode 100644 index 0000000..8d7fa67 --- /dev/null +++ b/umn/source/overview/components/flink/flink_enhanced_open_source_features/window.rst @@ -0,0 +1,73 @@ +:original_name: mrs_08_00345.html + +.. _mrs_08_00345: + +Window +====== + +Enhanced Open Source Feature: Window +------------------------------------ + +This section describes the sliding window of Flink and provides the sliding window optimization method. For details about windows, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html. + +**Introduction to Window** + +Data in a window is saved as intermediate results or original data. If you perform a sum operation (**window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds(5))).sum**) on data in the window, only the intermediate result will be retained. If a custom window (**window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds(5))).apply(new UDF)**) is used, all original data in the window will be saved. + +If custom windows **SlidingEventTimeWindow** and **SlidingProcessingTimeWindow** are used, data is saved as multiple backups. Assume that the window is defined as follows: + +.. code-block:: + + window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds(5))).apply(new UDFWindowFunction) + +If a block of data arrives, it is assigned to four different windows (20/5 = 4). That is, the data is saved as four copies in the memory. When the window size or sliding period is set to a large value, data will be saved as excessive copies, causing redundancy. + + +.. figure:: /_static/images/en-us_image_0000001349390625.png + :alt: **Figure 1** Original structure of a window + + **Figure 1** Original structure of a window + +If a data block arrives at the 102nd second, it is assigned to windows [85, 105), [90, 110), [95, 115), and [100, 120). + +**Window Optimization** + +As mentioned in the preceding, there are excessive data copies when original data is saved in SlidingEventTimeWindow and SlidingProcessingTimeWindow. To resolve this problem, the window that stores the original data is restructured, which optimizes the storage and greatly lowers the storage space. The window optimization scheme is as follows: + +#. Use the sliding period as a unit to divide a window into different panes. + + A window consists of one or multiple panes. A pane is essentially a sliding period. For example, the sliding period (namely, the pane) of **window(SlidingEventTimeWindows.of(Time.seconds(20), Time.seconds.of(5)))** lasts for 5 seconds. If this window ranges from [100, 120), this window can be divided into panes [100, 105), [105, 110), [110, 115), and [115, 120). + + + .. figure:: /_static/images/en-us_image_0000001296430762.png + :alt: **Figure 2** Window optimization + + **Figure 2** Window optimization + +#. When a data block arrives, it is not assigned to a specific window. Instead, Flink determines the pane to which the data block belongs based on the timestamp of the data block, and saves the data block into the pane. + + A data block is saved only in one pane. In this case, only a data copy exists in the memory. + + + .. figure:: /_static/images/en-us_image_0000001296590610.png + :alt: **Figure 3** Saving data in a window + + **Figure 3** Saving data in a window + +#. To trigger a window, compute all panes contained in the window, and combine all these panes into a complete window. + + + .. figure:: /_static/images/en-us_image_0000001349110457.png + :alt: **Figure 4** Triggering a window + + **Figure 4** Triggering a window + +#. If a pane is not required, you can delete it from the memory. + + + .. figure:: /_static/images/en-us_image_0000001349309913.png + :alt: **Figure 5** Deleting a window + + **Figure 5** Deleting a window + +After optimization, the quantity of data copies in the memory and snapshot is greatly reduced. diff --git a/umn/source/overview/components/flink/flink_ha_solution.rst b/umn/source/overview/components/flink/flink_ha_solution.rst new file mode 100644 index 0000000..2a6fbcb --- /dev/null +++ b/umn/source/overview/components/flink/flink_ha_solution.rst @@ -0,0 +1,69 @@ +:original_name: mrs_08_00342.html + +.. _mrs_08_00342: + +Flink HA Solution +================= + + +Flink HA Solution +----------------- + +A Flink cluster has only one JobManager. This has the risks of single point of failures (SPOFs). There are three modes of Flink: Flink On Yarn, Flink Standalone, and Flink Local. Flink On Yarn and Flink Standalone modes are based on clusters and Flink Local mode is based on a single node. Flink On Yarn and Flink Standalone provide an HA mechanism. With such a mechanism, you can recover the JobManager from failures and thereby eliminate SPOF risks. This section describes the HA mechanism of the Flink On Yarn. + +Flink supports the HA mode and job exception recovery that highly depend on ZooKeeper. If you want to enable the two functions, configure ZooKeeper in the **flink-conf.yaml** file in advance as follows: + +.. code-block:: + + high-availability: zookeeper + high-availability.zookeeper.quorum: ZooKeeper IP address:2181 + high-availability.storageDir: hdfs:///flink/recovery + +**Flink On Yarn** + +Flink JobManager and Yarn ApplicationMaster are in the same process. Yarn ResourceManager monitors ApplicationMaster. If ApplicationMaster is abnormal, Yarn restarts it and restores all JobManager metadata from HDFS. During the recovery, existing tasks cannot run and new tasks cannot be submitted. ZooKeeper stores JobManager metadata, such as information about jobs, to be used by the new JobManager. A TaskManager failure is listened and processed by the DeathWatch mechanism of Akka on JobManager. When a TaskManager fails, a container is requested again from Yarn and a TaskManager is created. + +For more information about the HA solution of Flink on Yarn, visit `https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html `__. + +For details about how to set **yarn-site.xml**, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/jobmanager_high_availability.html. + +**Standalone** + +In the standalone mode, multiple JobManagers can be started and ZooKeeper elects one as the leader JobManager. In this mode, there is a leader JobManager and multiple standby JobManagers. If the leader JobManager fails, a standby JobManager takes over the leadership. :ref:`Figure 1 ` shows the process of a leader/standby JobManager switchover. + +.. _mrs_08_00342__f5525c98e300e4d9b946da5d43305371c: + +.. figure:: /_static/images/en-us_image_0000001296270790.png + :alt: **Figure 1** Switchover process + + **Figure 1** Switchover process + +**Restoring TaskManager** + +A TaskManager failure is listened and processed by the DeathWatch mechanism of Akka on JobManager. If the TaskManager fails, the JobManager creates a TaskManager and migrates services to the created TaskManager. + +**Restoring JobManager** + +Flink JobManager and Yarn ApplicationMaster are in the same process. Yarn ResourceManager monitors ApplicationMaster. If ApplicationMaster is abnormal, Yarn restarts it and restores all JobManager metadata from HDFS. During the recovery, existing tasks cannot run and new tasks cannot be submitted. + +**Restoring Jobs** + +If you want to restore jobs, ensure that the startup policy is configured in Flink configuration files. Supported restart policies are **fixed-delay**, **failure-rate**, and **none**. Jobs can be restored only when the policy is configured to **fixed-delay** or **failure-rate**. If the restart policy is configured to **none** and checkpoint is configured for jobs, the restart policy is automatically configured to **fixed-delay** and the value of **restart-strategy.fixed-delay.attempts** (which specifies the number of retry times) is configured to **Integer.MAX_VALUE**. + +For details about the three strategies, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/task_failure_recovery.html. The following is an example of the restart policy configuration: + +.. code-block:: + + restart-strategy: fixed-delay + restart-strategy.fixed-delay.attempts: 3 + restart-strategy.fixed-delay.delay: 10 s + +Jobs will be restored in the following scenarios: + +- If a JobManager fails, all its jobs are stopped, and will be recovered after another JobManager is created and running. +- If a TaskManager fails, all tasks on the TaskManager are stopped, and will be started until there are available resources. +- When a task of a job fails, the job is restarted. + + .. note:: + + For details about how to configure the restart policy of a job, visit https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/jobmanager_high_availability.html. diff --git a/umn/source/overview/components/flink/index.rst b/umn/source/overview/components/flink/index.rst new file mode 100644 index 0000000..03c1d81 --- /dev/null +++ b/umn/source/overview/components/flink/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0034.html + +.. _mrs_08_0034: + +Flink +===== + +- :ref:`Flink Basic Principles ` +- :ref:`Flink HA Solution ` +- :ref:`Relationship with Other Components ` +- :ref:`Flink Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + flink_basic_principles + flink_ha_solution + relationship_with_other_components + flink_enhanced_open_source_features/index diff --git a/umn/source/overview/components/flink/relationship_with_other_components.rst b/umn/source/overview/components/flink/relationship_with_other_components.rst new file mode 100644 index 0000000..f1cb39a --- /dev/null +++ b/umn/source/overview/components/flink/relationship_with_other_components.rst @@ -0,0 +1,26 @@ +:original_name: mrs_08_00343.html + +.. _mrs_08_00343: + +Relationship with Other Components +================================== + +Relationship between Flink and Yarn +----------------------------------- + +Flink supports Yarn-based cluster management mode. In this mode, Flink serves as an application of Yarn and runs on Yarn. + +:ref:`Figure 1 ` shows how Flink interacts with Yarn. + +.. _mrs_08_00343__f8d08e179442e449abad5d92b707fe130: + +.. figure:: /_static/images/en-us_image_0000001349390773.png + :alt: **Figure 1** Flink interaction with Yarn + + **Figure 1** Flink interaction with Yarn + +#. The Flink YARN Client first checks whether there are sufficient resources for starting the Yarn cluster. If yes, the Flink Yarn client uploads JAR packages and configuration files to HDFS. +#. Flink Yarn client communicates with Yarn ResourceManager to request a container for starting ApplicationMaster. After all Yarn NodeManagers finish downloading the JAR package and configuration files, the ApplicationMaster is started. +#. During the startup, the ApplicationMaster interacts with the Yarn ResourceManager to request the container for starting a TaskManager. After the container is ready, the TaskManager process is started. +#. In the Flink Yarn cluster, the ApplicationMaster and Flink JobManager are running in the same container. The ApplicationMaster informs each TaskManager of the RPC address of the JobManager. After TaskManagers are started successfully, they register with the JobManager. +#. After all TaskManagers have registered with the JobManager successfully, Flink starts up in the Yarn cluster. Then, the Flink Yarn client can submit Flink jobs to the JobManager, and Flink can perform mapping, scheduling, and computing for the jobs. diff --git a/umn/source/overview/components/flume/flume_basic_principles.rst b/umn/source/overview/components/flume/flume_basic_principles.rst new file mode 100644 index 0000000..5c1e3cf --- /dev/null +++ b/umn/source/overview/components/flume/flume_basic_principles.rst @@ -0,0 +1,89 @@ +:original_name: mrs_08_00161.html + +.. _mrs_08_00161: + +Flume Basic Principles +====================== + +`Flume `__ is a distributed, reliable, and HA system that supports massive log collection, aggregation, and transmission. Flume supports customization of various data senders in the log system for data collection. In addition, Flume can roughly process data and write data to various data receivers (customizable). A Flume-NG is a branch of Flume. It is simple, small, and easy to deploy. The following figure shows the basic architecture of the Flume-NG. + + +.. figure:: /_static/images/en-us_image_0000001296590606.png + :alt: **Figure 1** Flume-NG architecture + + **Figure 1** Flume-NG architecture + +A Flume-NG consists of agents. Each agent consists of three components (source, channel, and sink). A source is used for receiving data. A channel is used for transmitting data. A sink is used for sending data to the next end. + +.. table:: **Table 1** Module description + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+=========================================================================================================================================================================================================+ + | Source | A source receives data or generates data by using a special mechanism, and places the data in batches in one or more channels. The source can work in data-driven or polling mode. | + | | | + | | Typical source types are as follows: | + | | | + | | - Sources that are integrated with the system, such as Syslog and Netcat | + | | - Sources that automatically generate events, such as Exec and SEQ | + | | - IPC sources that are used for communication between agents, such as Avro | + | | | + | | A source must be associated with at least one channel. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Channel | A channel is used to buffer data between a source and a sink. The channel caches data from the source and deletes that data after the sink sends the data to the next-hop channel or final destination. | + | | | + | | Different channels provide different persistence levels. | + | | | + | | - Memory channel: non-persistency | + | | - File channel: Write-Ahead Logging (WAL)-based persistence | + | | - JDBC channel: persistency implemented based on the embedded database | + | | | + | | The channel supports the transaction feature to ensure simple sequential operations. A channel can work with sources and sinks of any quantity. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Sink | A sink sends data to the next-hop channel or final destination. Once completed, the transmitted data is removed from the channel. | + | | | + | | Typical sink types are as follows: | + | | | + | | - Sinks that send storage data to the final destination, such as HDFS and HBase | + | | - Sinks that are consumed automatically, such as Null Sink | + | | - IPC sinks used for communication between Agents, such as Avro | + | | | + | | A sink must be associated with a specific channel. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +As shown in :ref:`Figure 2 `, a Flume client can have multiple sources, channels, and sinks. + +.. _mrs_08_00161__fig13368976181048: + +.. figure:: /_static/images/en-us_image_0000001349390617.png + :alt: **Figure 2** Flume structure + + **Figure 2** Flume structure + +The reliability of Flume depends on transaction switchovers between agents. If the next agent breaks down, the channel stores data persistently and transmits data until the agent recovers. The availability of Flume depends on the built-in load balancing and failover mechanisms. Both the channel and agent can be configured with multiple entities between which they can use load balancing policies. Each agent is a Java Virtual Machine (JVM) process. A server can have multiple agents. Collection nodes (for example, Agents 1, 2, 3) process logs. Aggregation nodes (for example, Agent 4) write the logs into HDFS. The agent of each collection node can select multiple aggregation nodes for load balancing. + + +.. figure:: /_static/images/en-us_image_0000001296270782.png + :alt: **Figure 3** Flume cascading + + **Figure 3** Flume cascading + +For details about Flume architecture and principles, see https://flume.apache.org/releases/1.9.0.html. + +Principle +--------- + +**Reliability Between Agents** + +:ref:`Figure 4 ` shows the data exchange between agents. + +.. _mrs_08_00161__fig4453710117500: + +.. figure:: /_static/images/en-us_image_0000001296430750.png + :alt: **Figure 4** Data transmission process + + **Figure 4** Data transmission process + +#. Flume ensures reliable data transmission based on transactions. When data flows from one agent to another agent, the two transactions take effect. The sink of Agent 1 (agent that sends a message) needs to obtain a message from a channel and sends the message to Agent 2 (agent that receives the message). If Agent 2 receives and successfully processes the message, Agent 1 will submit a transaction, indicating a successful and reliable data transmission. +#. When Agent 2 receives the message sent by Agent 1 and starts a new transaction, after the data is processed successfully (written to a channel), Agent 2 submits the transaction and sends a success response to Agent 1. +#. Before a commit operation, if the data transmission fails, the last transcription starts and retransmits the data that fails to be transmitted last time. The commit operation has written the transaction into a disk. Therefore, the last transaction can continue after the process fails and restores. diff --git a/umn/source/overview/components/flume/flume_enhanced_open_source_features.rst b/umn/source/overview/components/flume/flume_enhanced_open_source_features.rst new file mode 100644 index 0000000..ff566e4 --- /dev/null +++ b/umn/source/overview/components/flume/flume_enhanced_open_source_features.rst @@ -0,0 +1,15 @@ +:original_name: mrs_08_00163.html + +.. _mrs_08_00163: + +Flume Enhanced Open Source Features +=================================== + + +Flume Enhanced Open Source Features +----------------------------------- + +- Improving transmission speed: Multiple lines instead of only one line of data can be specified as an event. This improves the efficiency of code execution and reduces the times of disk writes. +- Transferring ultra-large binary files: According to the current memory usage, Flume automatically adjusts the memory used for transferring ultra-large binary files to prevent out-of-memory. +- Supporting the customization of preparations before and after transmission: Flume supports customized scripts to be run before or after transmission for making preparations. +- Managing client alarms: Flume receives Flume client alarms through MonitorServer and reports the alarms to the alarm management center on MRS Manager. diff --git a/umn/source/overview/components/flume/index.rst b/umn/source/overview/components/flume/index.rst new file mode 100644 index 0000000..1fec9e4 --- /dev/null +++ b/umn/source/overview/components/flume/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0016.html + +.. _mrs_08_0016: + +Flume +===== + +- :ref:`Flume Basic Principles ` +- :ref:`Relationship Between Flume and Other Components ` +- :ref:`Flume Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + flume_basic_principles + relationship_between_flume_and_other_components + flume_enhanced_open_source_features diff --git a/umn/source/overview/components/flume/relationship_between_flume_and_other_components.rst b/umn/source/overview/components/flume/relationship_between_flume_and_other_components.rst new file mode 100644 index 0000000..ff5c549 --- /dev/null +++ b/umn/source/overview/components/flume/relationship_between_flume_and_other_components.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_00162.html + +.. _mrs_08_00162: + +Relationship Between Flume and Other Components +=============================================== + +Relationship Between Flume and HDFS +----------------------------------- + +If HDFS is configured as the Flume sink, HDFS functions as the final data storage system of Flume. Flume installs, configures, and writes all transmitted data into HDFS. + +Relationship Between Flume and HBase +------------------------------------ + +If HBase is configured as the Flume sink, HBase functions as the final data storage system of Flume. Flume writes all transmitted data into HBase based on configurations. diff --git a/umn/source/overview/components/hbase/hbase_basic_principles.rst b/umn/source/overview/components/hbase/hbase_basic_principles.rst new file mode 100644 index 0000000..64e7a29 --- /dev/null +++ b/umn/source/overview/components/hbase/hbase_basic_principles.rst @@ -0,0 +1,142 @@ +:original_name: mrs_08_00101.html + +.. _mrs_08_00101: + +HBase Basic Principles +====================== + +HBase undertakes data storage. HBase is an open source, column-oriented, distributed storage system that is suitable for storing massive amounts of unstructured or semi-structured data. It features high reliability, high performance, and flexible scalability, and supports real-time data read/write. For more information about HBase, see https://hbase.apache.org/. + +Typical features of a table stored in HBase are as follows: + +- Big table (BigTable): One table contains hundred millions of lines and millions of columns. +- Column-oriented: Column-oriented storage, retrieval, and permission control +- Sparse: Null columns in the table do not occupy any storage space. + +The HBase component of MRS separates computing from storage. Data can be stored in cloud storage services at low cost, for example, Object Storage Service (OBS), and can be backed up across AZs. MRS supports secondary indexes for HBase and allows adding indexes for column values to filter data by column through native HBase APIs. + +HBase architecture +------------------ + +An HBase cluster consists of active and standby HMaster processes and multiple RegionServer processes, as shown in :ref:`Figure 1 `. + +.. _mrs_08_00101__fb06af549b46f4156a4efed4bde54463e: + +.. figure:: /_static/images/en-us_image_0000001296590642.png + :alt: **Figure 1** HBase architecture + + **Figure 1** HBase architecture + +.. table:: **Table 1** Module description + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+=======================================================================================================================================================================================================================================================================================================+ + | Master | Master is also called HMaster. In HA mode, HMaster consists of an active HMaster and a standby HMaster. | + | | | + | | - Active Master: manages RegionServer in HBase, including the creation, deletion, modification, and query of a table, balances the load of RegionServer, adjusts the distribution of Region, splits Region and distributes Region after it is split, and migrates Region after RegionServer expires. | + | | - Standby Master: takes over services when the active HMaster is faulty. The original active HMaster demotes to the standby HMaster after the fault is rectified. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Client | Client communicates with Master for management and with RegionServer for data protection by using the Remote Procedure Call (RPC) mechanism of HBase. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | RegionServer | RegionServer provides read and write services of table data as a data processing and computing unit in HBase. | + | | | + | | RegionServer is deployed with DataNodes of HDFS clusters to store data. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper cluster | ZooKeeper provides distributed coordination services for processes in HBase clusters. Each RegionServer is registered with ZooKeeper so that the active Master can obtain the health status of each RegionServer. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HDFS cluster | HDFS provides highly reliable file storage services for HBase. All HBase data is stored in the HDFS. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +HBase Principles +---------------- + +- **HBase Data Model** + + HBase stores data in tables, as shown in :ref:`Figure 2 `. Data in a table is divided into multiple Regions, which are allocated by Master to RegionServers for management. + + Each Region contains data within a RowKey range. An HBase data table contains only one Region at first. As the number of data increases and reaches the upper limit of the Region capacity, the Region is split into two Regions. You can define the RowKey range of a Region when creating a table or define the Region size in the configuration file. + + .. _mrs_08_00101__fig29501314327: + + .. figure:: /_static/images/en-us_image_0000001440610749.png + :alt: **Figure 2** HBase data model + + **Figure 2** HBase data model + + .. table:: **Table 2** Concepts + + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===============+============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | RowKey | Similar to the primary key in a relationship table, which is the unique ID of the data in each row. A RowKey can be a string, integer, or binary string. All records are stored after being sorted by RowKey. | + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Timestamp | The timestamp of a data operation. Data can be specified with different versions by time stamp. Data of different versions in each cell is stored by time in descending order. | + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cell | Minimum storage unit of HBase, consisting of keys and values. A key consists of six fields, namely row, column family, column qualifier, timestamp, type, and MVCC version. Values are the binary data objects. | + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Column Family | One or multiple horizontal column families form a table. A column family can consist of multiple random columns. A column is a label under a column family, which can be added as required when data is written. The column family supports dynamic expansion so the number and type of columns do not need to be predefined. Columns of a table in HBase are sparsely distributed. The number and type of columns in different rows can be different. Each column family has the independent time to live (TTL). You can lock the row only. Operations on the row in a column family are the same as those on other rows. | + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Column | Similar to traditional databases, HBase tables also use columns to store data of the same type. | + +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **RegionServer Data Storage** + + RegionServer manages the regions allocated by HMaster. :ref:`Figure 3 ` shows the data storage structure of RegionServer. + + .. _mrs_08_00101__fd8c05b438f4e419a845ac4086068129e: + + .. figure:: /_static/images/en-us_image_0000001349190357.png + :alt: **Figure 3** RegionServer data storage structure + + **Figure 3** RegionServer data storage structure + + :ref:`Table 3 ` lists each component of Region described in :ref:`Figure 3 `. + + .. _mrs_08_00101__td05a58b8249240a58b063a9ccb1f780c: + + .. table:: **Table 3** Region structure description + + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===========+=================================================================================================================================================================================================================================================================+ + | Store | A Region consists of one or multiple Stores. Each Store maps a column family in :ref:`Figure 2 `. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MemStore | A Store contains one MemStore. The MemStore caches data inserted to a Region by the client. When the MemStore capacity reaches the upper limit, RegionServer flushes data in MemStore to the HDFS. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | StoreFile | The data flushed to the HDFS is stored as a StoreFile in the HDFS. As more data is inserted, multiple StoreFiles are generated in a Store. When the number of StoreFiles reaches the upper limit, RegionServer merges multiple StoreFiles into a big StoreFile. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HFile | HFile defines the storage format of StoreFiles in a file system. HFile is the underlying implementation of StoreFile. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HLog | HLogs prevent data loss when RegionServer is faulty. Multiple Regions in a RegionServer share the same HLog. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Metadata Table** + + The metadata table is a special HBase table, which is used by the client to locate a region. Metadata table includes **hbase:meta** table to record region information of user tables, such as the region location and start and end RowKey. + + :ref:`Figure 4 ` shows the mapping relationship between metadata tables and user tables. + + .. _mrs_08_00101__f1dd74070230f4fd99e391a49363263c2: + + .. figure:: /_static/images/en-us_image_0000001349309941.png + :alt: **Figure 4** Mapping relationships between metadata tables and user tables + + **Figure 4** Mapping relationships between metadata tables and user tables + +- **Data Operation Process** + + :ref:`Figure 5 ` shows the HBase data operation process. + + .. _mrs_08_00101__fa04dc85032574524af4fa8ab21b5642b: + + .. figure:: /_static/images/en-us_image_0000001296430786.png + :alt: **Figure 5** Data processing + + **Figure 5** Data processing + + #. When you add, delete, modify, and query HBase data, the HBase client first connects to ZooKeeper to obtain information about the RegionServer where the **hbase:meta** table is located. If you modify the namespace, such as creating and deleting a table, you need to access HMaster to update the meta information. + #. The HBase client connects to the RegionServer where the region of the **hbase:meta** table is located and obtains the RegionServer location where the region of the user table resides. + #. Then the HBase client connects to the RegionServer where the region of the user table is located and issues a data operation command to the RegionServer. The RegionServer executes the command. + + To improve data processing efficiency, the HBase client caches region information of the **hbase:meta** table and user table. When an application initiates a second data operation, the HBase client queries the region information from the memory. If no match is found in the memory, the HBase client performs the preceding operations to obtain region information. diff --git a/umn/source/overview/components/hbase/hbase_enhanced_open_source_features.rst b/umn/source/overview/components/hbase/hbase_enhanced_open_source_features.rst new file mode 100644 index 0000000..172a814 --- /dev/null +++ b/umn/source/overview/components/hbase/hbase_enhanced_open_source_features.rst @@ -0,0 +1,262 @@ +:original_name: mrs_08_00104.html + +.. _mrs_08_00104: + +HBase Enhanced Open Source Features +=================================== + +HIndex +------ + +HBase is a distributed storage database of the Key-Value type. Data of a table is sorted in the alphabetic order based on row keys. If you query data based on a specified row key or scan data in the scale of a specified row key, HBase can quickly locate the target data, enhancing the efficiency. + +However, in most actual scenarios, you need to query the data of which the column value is *XXX*. HBase provides the Filter feature to query data with a specific column value. All data is scanned in the order of row keys, and then the data is matched with the specific column value until the required data is found. The Filter feature scans some unnecessary data to obtain the only required data. Therefore, the Filter feature cannot meet the requirements of frequent queries with high performance standards. + +HBase HIndex is designed to address these issues. HBase HIndex enables HBase to query data based on specific column values. + + +.. figure:: /_static/images/en-us_image_0000001388353450.png + :alt: **Figure 1** HIndex + + **Figure 1** HIndex + +- Rolling upgrade is not supported for index data. + +- Restrictions of combined indexes: + + - All columns involved in combined indexes must be entered or deleted in a single mutation. Otherwise, inconsistency will occur. + + Index: **IDX1=>cf1:[q1->datatype],[q2];cf2:[q2->datatype]** + + Correct write operations: + + .. code-block:: + + Put put = new Put(Bytes.toBytes("row")); + put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA")); + put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB")); + put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueC")); + table.put(put); + + Incorrect write operations: + + .. code-block:: + + Put put1 = new Put(Bytes.toBytes("row")); + put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA")); + table.put(put1); + Put put2 = new Put(Bytes.toBytes("row")); + put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB")); + table.put(put2); + Put put3 = new Put(Bytes.toBytes("row")); + put3.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueC")); + table.put(put3); + + - The combined conditions-based query is supported only when the combined index column contains filter criteria, or StartRow and StopRow are not specified for some index columns. + + Index: **IDX1=>cf1:[q1->datatype],[q2];cf2:[q1->datatype]** + + Correct query operations: + + .. code-block:: + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',>=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true) "} + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true)" } + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',>=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true)",STARTROW=>'row001',STOPROW=>'row100'} + + Incorrect query operations: + + .. code-block:: + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',>=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true) AND SingleColumnValueFilter('cf2','q2',>=,'binary:valueD',true,true)"} + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true)" } + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf2','q2',>=,'binary:valueD',true,true)" } + + scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true)" ,STARTROW=>'row001',STOPROW=>'row100' } + +- Do not explicitly configure any split policy for tables with index data. + +- Other mutation operations, such as **increment** and **append**, are not supported. + +- Index of the column with **maxVersions** greater than 1 is not supported. + +- The data index column in a row cannot be updated. + + Index 1: **IDX1=>cf1:[q1->datatype],[q2];cf2:[q1->datatype]** + + Index 2: **IDX2=>cf2:[q2->datatype]** + + Correct update operations: + + .. code-block:: + + Put put1 = new Put(Bytes.toBytes("row")); + put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA")); + put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB")); + put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q1"), Bytes.toBytes("valueC")); + put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueD")); + table.put(put1); + + Put put2 = new Put(Bytes.toBytes("row")); + put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q3"), Bytes.toBytes("valueE")); + put2.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q3"), Bytes.toBytes("valueF")); + table.put(put2); + + Incorrect update operations: + + .. code-block:: + + Put put1 = new Put(Bytes.toBytes("row")); + put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA")); + put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB")); + put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q1"), Bytes.toBytes("valueC")); + put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueD")); + table.put(put1); + + Put put2 = new Put(Bytes.toBytes("row")); + put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA_new")); + put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB_new")); + put2.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q1"), Bytes.toBytes("valueC_new")); + put2.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueD_new")); + table.put(put2); + +- The table to which an index is added cannot contain a value greater than 32 KB. + +- If user data is deleted due to the expiration of the column-level TTL, the corresponding index data is not deleted immediately. It will be deleted in the major compaction operation. + +- The TTL of the user column family cannot be modified after the index is created. + + - If the TTL of a column family increases after an index is created, delete the index and re-create one. Otherwise, some generated index data will be deleted before user data is deleted. + - If the TTL value of the column family decreases after an index is created, the index data will be deleted after user data is deleted. + +- The index query does not support the reverse operation, and the query results are disordered. + +- The index does not support the **clone snapshot** operation. + +- The index table must use HIndexWALPlayer to replay logs. WALPlayer cannot be used to replay logs. + + .. code-block:: + + hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexWALPlayer + Usage: WALPlayer [options] [] + Read all WAL entries for . + If no tables ("") are specific, all tables are imported. + (Careful, even -ROOT- and hbase:meta entries will be imported in that case.) + Otherwise is a comma separated list of tables. + + The WAL entries can be mapped to new set of tables via . + is a command separated list of targettables. + If specified, each table in must have a mapping. + + By default WALPlayer will load data directly into HBase. + To generate HFiles for a bulk data load instead, pass the option: + -Dwal.bulk.output=/path/for/output + (Only one table can be specified, and no mapping is allowed!) + Other options: (specify time range to WAL edit to consider) + -Dwal.start.time=[date|ms] + -Dwal.end.time=[date|ms] + For performance also consider the following options: + -Dmapreduce.map.speculative=false + -Dmapreduce.reduce.speculative=false + +- When the **deleteall** command is executed for the index table, the performance is low. + +- The index table does not support HBCK. To use HBCK to repair the index table, delete the index data first. + +Multi-point Division +-------------------- + +When you create tables that are pre-divided by region in HBase, you may not know the data distribution trend so the division by region may be inappropriate. After the system runs for a period, regions need to be divided again to achieve better performance. Only empty regions can be divided. + +The region division function delivered with HBase divides regions only when they reach the threshold. This is called "single point division". + +To achieve better performance when regions are divided based on user requirements, multi-point division is developed, which is also called "dynamic division". That is, an empty region is pre-divided into multiple regions to prevent performance deterioration caused by insufficient region space. + + +.. figure:: /_static/images/en-us_image_0000001296270810.png + :alt: **Figure 2** Multi-point division + + **Figure 2** Multi-point division + +Connection Limitation +--------------------- + +Too many sessions mean that too many queries and MapReduce tasks are running on HBase, which compromises HBase performance and even causes service rejection. You can configure parameters to limit the maximum number of sessions that can be established between the client and the HBase server to achieve HBase overload protection. + +Improved Disaster Recovery +-------------------------- + +The disaster recovery (DR) capabilities between the active and standby clusters can enhance HA of the HBase data. The active cluster provides data services and the standby cluster backs up data. If the active cluster is faulty, the standby cluster takes over data services. Compared with the open source replication function, this function is enhanced as follows: + +#. The standby cluster whitelist function is only applicable to pushing data to a specified cluster IP address. +#. In the open source version, replication is synchronized based on WAL, and data backup is implemented by replaying WAL in the standby cluster. For BulkLoad operations, since no WAL is generated, data will not be replicated to the standby cluster. By recording BulkLoad operations on the WAL and synchronizing them to the standby cluster, the standby cluster can read BulkLoad operation records through WAL and load HFile in the active cluster to the standby cluster to implement data backup. +#. In the open source version, HBase filters ACLs. Therefore, ACL information will not be synchronized to the standby cluster. By adding a filter (**org.apache.hadoop.hbase.replication.SystemTableWALEntryFilterAllowACL**), ACL information can be synchronized to the standby cluster. You can configure **hbase.replication.filter.sytemWALEntryFilter** to enable the filter and implement ACL synchronization. +#. As for read-only restriction of the standby cluster, only super users within the standby cluster can modify the HBase of the standby cluster. In other words, HBase clients outside the standby cluster can only read the HBase of the standby cluster. + +HBase MOB +--------- + +In the actual application scenarios, data in various sizes needs to be stored, for example, image data and documents. Data whose size is smaller than 10 MB can be stored in HBase. HBase can yield the best read-and-write performance for data whose size is smaller than 100 KB. If the size of data stored in HBase is greater than 100 KB or even reaches 10 MB and the same number of data files are inserted, the total data amount is large, causing frequent compaction and split, high CPU consumption, high disk I/O frequency, and low performance. + +MOB data (whose size ranges from 100 KB to 10 MB) is stored in a file system (for example, HDFS) in HFile format. The expiredMobFileCleaner and Sweeper tools are used to manage HFiles and save the address and size information about the HFiles to the store of HBase as values. This greatly decreases the compaction and split frequency in HBase and improves performance. + +As shown in :ref:`Figure 3 `, MOB indicates mobstore stored on HRegion. Mobstore stores keys and values. Wherein, a key is the corresponding key in HBase, and a value is the reference address and data offset stored in the file system. When reading data, mobstore uses its own scanner to read key-value data objects and uses the address and data size information in the value to obtain target data from the file system. + +.. _mrs_08_00104__f230cbe9084ca4d608b9af5f36a6cbfed: + +.. figure:: /_static/images/en-us_image_0000001296590634.png + :alt: **Figure 3** MOB data storage principle + + **Figure 3** MOB data storage principle + +HFS +--- + +HBase FileStream (HFS) is an independent HBase file storage module. It is used in MRS upper-layer applications by encapsulating HBase and HDFS interfaces to provide these upper-layer applications with functions such as file storage, read, and deletion. + +In the Hadoop ecosystem, the HDFS and HBase face tough problems in mass file storage in some scenarios: + +- If a large number of small files are stored in HDFS, the NameNode will be under great pressure. +- Some large files cannot be directly stored on HBase due to HBase APIs and internal mechanisms. + +HFS is developed for the mixed storage of massive small files and some large files in Hadoop. Simply speaking, massive small files (smaller than 10 MB) and some large files (greater than 10 MB) need to be stored in HBase tables. + +For such a scenario, HFS provides unified operation APIs similar to HBase function APIs. + +Multiple RegionServers Deployed on the Same Server +-------------------------------------------------- + +Multiple RegionServers can be deployed on one node to improve HBase resource utilization. + +If only one RegionServer is deployed, resource utilization is low due to the following reasons: + +#. A RegionServer supports a limited number of regions, and therefore memory and CPU resources cannot be fully used. +#. A single RegionServer supports a maximum of 20 TB data, of which two copies require 40 TB, and three copies require 60 TB. In this case, 96 TB capacity cannot be used up. +#. Poor write performance: One RegionServer is deployed on a physical server, and only one HLog exists. Only three disks can be written at the same time. + +The HBase resource utilization can be improved when multiple RegionServers are deployed on the same server. + +#. A physical server can be configured with a maximum of five RegionServers. The number of RegionServers deployed on each physical server can be configured as required. +#. Resources such as memory, disks, and CPUs can be fully used. +#. A physical server supports a maximum of five HLogs and allows data to be written to 15 disks at the same time, significantly improving write performance. + + +.. figure:: /_static/images/en-us_image_0000001349190349.png + :alt: **Figure 4** Improved HBase resource utilization + + **Figure 4** Improved HBase resource utilization + +HBase Dual-Read +--------------- + +In the HBase storage scenario, it is difficult to ensure 99.9% query stability due to GC, network jitter, and bad sectors of disks. The HBase dual-read feature is added to meet the requirements of low glitches during large-data-volume random read. + +The HBase dual-read feature is based on the DR capability of the active and standby clusters. The probability that the two clusters generate glitches at the same time is far less than that of one cluster. The dual-cluster concurrent access mode is used to ensure query stability. When a user initiates a query request, the HBase service of the two clusters is queried at the same time. If the active cluster does not return any result after a period of time (the maximum tolerable glitch time), the data of the cluster with the fastest response can be used. The following figure shows the working principle. + +|image1| + +.. |image1| image:: /_static/images/en-us_image_0000001296750250.png diff --git a/umn/source/overview/components/hbase/hbase_ha_solution.rst b/umn/source/overview/components/hbase/hbase_ha_solution.rst new file mode 100644 index 0000000..c673cbe --- /dev/null +++ b/umn/source/overview/components/hbase/hbase_ha_solution.rst @@ -0,0 +1,25 @@ +:original_name: mrs_08_00102.html + +.. _mrs_08_00102: + +HBase HA Solution +================= + +HBase HA +-------- + +HMaster in HBase allocates Regions. When one RegionServer service is stopped, HMaster migrates the corresponding Region to another RegionServer. The HMaster HA feature is brought in to prevent HBase functions from being affected by the HMaster single point of failure (SPOF). + + +.. figure:: /_static/images/en-us_image_0000001349190369.png + :alt: **Figure 1** HMaster HA implementation architecture + + **Figure 1** HMaster HA implementation architecture + +The HMaster HA architecture is implemented by creating the ephemeral ZooKeeper node in a ZooKeeper cluster. + +Upon startup, HMaster nodes try to create a master znode in the ZooKeeper cluster. The HMaster node that creates the master znode first becomes the active HMaster, and the other is the standby HMaster. + +It will add watch events to the master node. If the service on the active HMaster is stopped, the active HMaster disconnects from the ZooKeeper cluster. After the session expires, the active HMaster disappears. The standby HMaster detects the disappearance of the active HMaster through watch events and creates a master node to make itself be the active one. Then, the active/standby switchover completes. If the failed node detects existence of the master node after being restarted, it enters the standby state and adds watch events to the master node. + +When the client accesses the HBase, it first obtains the HMaster's address based on the master node information on the ZooKeeper and then establishes a connection to the active HMaster. diff --git a/umn/source/overview/components/hbase/index.rst b/umn/source/overview/components/hbase/index.rst new file mode 100644 index 0000000..c95f97c --- /dev/null +++ b/umn/source/overview/components/hbase/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0010.html + +.. _mrs_08_0010: + +HBase +===== + +- :ref:`HBase Basic Principles ` +- :ref:`HBase HA Solution ` +- :ref:`Relationship with Other Components ` +- :ref:`HBase Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hbase_basic_principles + hbase_ha_solution + relationship_with_other_components + hbase_enhanced_open_source_features diff --git a/umn/source/overview/components/hbase/relationship_with_other_components.rst b/umn/source/overview/components/hbase/relationship_with_other_components.rst new file mode 100644 index 0000000..10a9a13 --- /dev/null +++ b/umn/source/overview/components/hbase/relationship_with_other_components.rst @@ -0,0 +1,27 @@ +:original_name: mrs_08_00103.html + +.. _mrs_08_00103: + +Relationship with Other Components +================================== + +Relationship Between HDFS and HBase +----------------------------------- + +HDFS is the subproject of Apache Hadoop. HBase uses the Hadoop Distributed File System (HDFS) as the file storage system. HBase is located in structured storage layer. The HDFS provides highly reliable support for lower-layer storage of HBase. All the data files of HBase can be stored in the HDFS, except some log files generated by HBase. + +Relationship Between ZooKeeper and HBase +---------------------------------------- + +:ref:`Figure 1 ` describes the relationship between ZooKeeper and HBase. + +.. _mrs_08_00103__fdde9276ccf0c408182da95f53612b7f9: + +.. figure:: /_static/images/en-us_image_0000001349110589.png + :alt: **Figure 1** Relationship between ZooKeeper and HBase + + **Figure 1** Relationship between ZooKeeper and HBase + +#. HRegionServer registers itself to ZooKeeper in Ephemeral node. ZooKeeper stores the HBase information, including the HBase metadata and HMaster addresses. +#. HMaster detects the health status of each HRegionServer using ZooKeeper, and monitors them. +#. HBase can deploy multiple HMasters (like HDFS NameNode). When the active HMatser node is faulty, the standby HMaster node obtains the state information of the entire cluster using ZooKeeper, which means that HBase single point faults can be avoided using ZooKeeper. diff --git a/umn/source/overview/components/hdfs/hdfs_basic_principles.rst b/umn/source/overview/components/hdfs/hdfs_basic_principles.rst new file mode 100644 index 0000000..41cd298 --- /dev/null +++ b/umn/source/overview/components/hdfs/hdfs_basic_principles.rst @@ -0,0 +1,90 @@ +:original_name: mrs_08_00071.html + +.. _mrs_08_00071: + +HDFS Basic Principles +===================== + +Hadoop Distributed File System (HDFS) implements reliable and distributed read/write of massive amounts of data. HDFS is applicable to the scenario where data read/write features "write once and read multiple times". However, the write operation is performed in sequence, that is, it is a write operation performed during file creation or an adding operation performed behind the existing file. HDFS ensures that only one caller can perform write operation on a file but multiple callers can perform read operation on the file at the same time. + +Architecture +------------ + +HDFS consists of active and standby NameNodes and multiple DataNodes, as shown in :ref:`Figure 1 `. + +HDFS works in master/slave architecture. NameNodes run on the master (active) node, and DataNodes run on the slave (standby) node. ZKFC should run along with the NameNodes. + +The communication between NameNodes and DataNodes is based on Transmission Control Protocol (TCP)/Internet Protocol (IP). The NameNode, DataNode, ZKFC, and JournalNode can be deployed on Linux servers. + +.. _mrs_08_00071__fig1245232216814: + +.. figure:: /_static/images/en-us_image_0000001349110533.png + :alt: **Figure 1** HA HDFS architecture + + **Figure 1** HA HDFS architecture + +:ref:`Table 1 ` describes the functions of each module shown in :ref:`Figure 1 `. + +.. _mrs_08_00071__table144529223812: + +.. table:: **Table 1** Module description + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+========================================================================================================================================================================================================================================================================================+ + | NameNode | A NameNode is used to manage the namespace, directory structure, and metadata information of a file system and provide the backup mechanism. The NameNode is classified into the following two types: | + | | | + | | - Active NameNode: manages the namespace, maintains the directory structure and metadata of file systems, and records the mapping relationships between data blocks and files to which the data blocks belong. | + | | - Standby NameNode: synchronizes with the data in the active NameNode, and takes over services from the active NameNode when the active NameNode is faulty. | + | | - Observer NameNode: synchronizes with the data in the active NameNode, and processes read requests from the client. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DataNode | A DataNode is used to store data blocks of each file and periodically report the storage status to the NameNode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | JournalNode | In HA cluster, synchronizes metadata between the active and standby NameNodes. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZKFC | ZKFC must be deployed for each NameNode. It monitors NameNode status and writes status information to ZooKeeper. ZKFC also has permissions to select the active NameNode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZK Cluster | ZooKeeper is a coordination service which helps the ZKFC to elect the active NameNode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HttpFS gateway | HttpFS is a single stateless gateway process which provides the WebHDFS REST API for external processes and FileSystem API for the HDFS. HttpFS is used for data transmission between different versions of Hadoop. It is also used as a gateway to access the HDFS behind a firewall. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **HDFS HA Architecture** + + HA is used to resolve the SPOF problem of NameNode. This feature provides a standby NameNode for the active NameNode. When the active NameNode is faulty, the standby NameNode can quickly take over to continuously provide services for external systems. + + In a typical HDFS HA scenario, there are usually two NameNodes. One is in the active state, and the other in the standby state. + + A shared storage system is required to support metadata synchronization of the active and standby NameNodes. This version provides Quorum Journal Manager (QJM) HA solution, as shown in :ref:`Figure 2 `. A group of JournalNodes are used to synchronize metadata between the active and standby NameNodes. + + Generally, an odd number (2N+1) of JournalNodes are configured, and at least three JournalNodes are required. For one metadata update message, data writing is considered successful as long as data writing is successful on N+1 JournalNodes. In this case, data writing failure of a maximum of N JournalNodes is allowed. For example, when there are three JournalNodes, data writing failure of one JournalNode is allowed; when there are five JournalNodes, data writing failure of two JournalNodes is allowed. + + JournalNode is a lightweight daemon process and shares a host with other services of Hadoop. It is recommended that the JournalNode be deployed on the control node to prevent data writing failure on the JournalNode during massive data transmission. + + .. _mrs_08_00071__fig1517714182104: + + .. figure:: /_static/images/en-us_image_0000001296590686.png + :alt: **Figure 2** QJM-based HDFS architecture + + **Figure 2** QJM-based HDFS architecture + +Principle +--------- + +MRS uses the HDFS copy mechanism to ensure data reliability. One backup file is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The number of HDFS copies can be queried using the **dfs.replication** parameter. + +- When the Core node specification of the MRS cluster is set to non-local hard disk drive (HDD) and the cluster has only one Core node, the default number of HDFS copies is 1. If the number of Core nodes in the cluster is greater than or equal to 2, the default number of HDFS copies is 2. +- When the Core node specification of the MRS cluster is set to local disk and the cluster has only one Core node, the default number of HDFS copies is 1. If there are two Core nodes in the cluster, the default number of HDFS copies is 2. If the number of Core nodes in the cluster is greater than or equal to 3, the default number of HDFS copies is 3. + + +.. figure:: /_static/images/en-us_image_0000001296430834.png + :alt: **Figure 3** HDFS architecture + + **Figure 3** HDFS architecture + +The HDFS component of MRS supports the following features: + +- Supports erasure code, reducing data redundancy to 50% and improving reliability. In addition, the striped block storage structure is introduced to maximize the use of the capability of a single node and multiple disks in an existing cluster. After the coding process is introduced, the data write performance is improved, and the performance is close to that with the multi-copy redundancy. +- Supports balanced node scheduling on HDFS and balanced disk scheduling on a single node, improving HDFS storage performance after node or disk scale-out. + +For details about the Hadoop architecture and principles, see `https://hadoop.apache.org/ `__. diff --git a/umn/source/overview/components/hdfs/hdfs_enhanced_open_source_features.rst b/umn/source/overview/components/hdfs/hdfs_enhanced_open_source_features.rst new file mode 100644 index 0000000..a4b1a3c --- /dev/null +++ b/umn/source/overview/components/hdfs/hdfs_enhanced_open_source_features.rst @@ -0,0 +1,160 @@ +:original_name: mrs_08_00074.html + +.. _mrs_08_00074: + +HDFS Enhanced Open Source Features +================================== + +Enhanced Open Source Feature: File Block Colocation +--------------------------------------------------- + +In the offline data summary and statistics scenario, Join is a frequently used computing function, and is implemented in MapReduce as follows: + +#. The Map task processes the records in the two table files into Join Key and Value, performs hash partitioning by Join Key, and sends the data to different Reduce tasks for processing. + +#. Reduce tasks read data in the left table recursively in the nested loop mode and traverse each line of the right table. If join key values are identical, join results are output. + + The preceding method sharply reduces the performance of the join calculation. Because a large amount of network data transfer is required when the data stored in different nodes is sent from MAP to Reduce, as shown in :ref:`Figure 1 `. + +.. _mrs_08_00074__f05a707e492b34d8d81a4e5da7c75f85a: + +.. figure:: /_static/images/en-us_image_0000001296270802.png + :alt: **Figure 1** Data transmission in the non-colocation scenario + + **Figure 1** Data transmission in the non-colocation scenario + +Data tables are stored in physical file system by HDFS block. Therefore, if two to-be-joined blocks are put into the same host accordingly after they are partitioned by join key, you can obtain the results directly from Map join in the local node without any data transfer in the Reduce process of the join calculation. This will greatly improve the performance. + +With the identical distribution feature of HDFS data, a same distribution ID is allocated to files, FileA and FileB, on which association and summation calculations need to be performed. In this way, all the blocks are distributed together, and calculation can be performed without retrieving data across nodes, which greatly improves the MapReduce join performance. + + +.. figure:: /_static/images/en-us_image_0000001296270806.png + :alt: **Figure 2** Data block distribution in colocation and non-colocation scenarios + + **Figure 2** Data block distribution in colocation and non-colocation scenarios + +Enhanced Open Source Feature: Damaged Hard Disk Volume Configuration +-------------------------------------------------------------------- + +In the open source version, if multiple data storage volumes are configured for a DataNode, the DataNode stops providing services by default if one of the volumes is damaged. If the configuration item **dfs.datanode.failed.volumes.tolerated** is set to specify the number of damaged volumes that are allowed, DataNode continues to provide services when the number of damaged volumes does not exceed the threshold. + +The value of **dfs.datanode.failed.volumes.tolerated** ranges from -1 to the number of disk volumes configured on the DataNode. The default value is **-1**, as shown in :ref:`Figure 3 `. + +.. _mrs_08_00074__f1c5d1a35b09d4165ab7166b6b4017be6: + +.. figure:: /_static/images/en-us_image_0000001296590626.png + :alt: **Figure 3** Item being set to 0 + + **Figure 3** Item being set to 0 + +For example, three data storage volumes are mounted to a DataNode, and **dfs.datanode.failed.volumes.tolerated** is set to 1. In this case, if one data storage volume of the DataNode is unavailable, this DataNode can still provide services, as shown in :ref:`Figure 4 `. + +.. _mrs_08_00074__f0099868cda664d66ab26565747692993: + +.. figure:: /_static/images/en-us_image_0000001349110469.png + :alt: **Figure 4** Item being set to 1 + + **Figure 4** Item being set to 1 + +This native configuration item has some defects. When the number of data storage volumes in each DataNode is inconsistent, you need to configure each DataNode independently instead of generating the unified configuration file for all nodes. + +Assume that there are three DataNodes in a cluster. The first node has three data directories, the second node has four, and the third node has five. If you want to ensure that DataNode services are available when only one data directory is available, you need to perform the configuration as shown in :ref:`Figure 5 `. + +.. _mrs_08_00074__f1b00e531fed543d9b6d7cb0ab4580247: + +.. figure:: /_static/images/en-us_image_0000001296430774.jpg + :alt: **Figure 5** Attribute configuration before being enhanced + + **Figure 5** Attribute configuration before being enhanced + +In self-developed enhanced HDFS, this configuration item is enhanced, with a value **-1** added. When this configuration item is set to **-1**, all DataNodes can provide services as long as one data storage volume in all DataNodes is available. + +To resolve the problem in the preceding example, set this configuration to **-1**, as shown in :ref:`Figure 6 `. + +.. _mrs_08_00074__fbe7d5e2177504d649264458d52f4d090: + +.. figure:: /_static/images/en-us_image_0000001349190337.jpg + :alt: **Figure 6** Attribute configuration after being enhanced + + **Figure 6** Attribute configuration after being enhanced + +Enhanced Open Source Feature: HDFS Startup Acceleration +------------------------------------------------------- + +In HDFS, when NameNodes start, the metadata file FsImage needs to be loaded. Then, DataNodes will report the data block information after the DataNodes startup. When the data block information reported by DataNodes reaches the preset percentage, NameNodes exits safe mode to complete the startup process. If the number of files stored on the HDFS reaches the million or billion level, the two processes are time-consuming and will lead to a long startup time of the NameNode. Therefore, this version optimizes the process of loading metadata file FsImage. + +In the open source HDFS, FsImage stores all types of metadata information. Each type of metadata information (such as file metadata information and folder metadata information) is stored in a section block, respectively. These section blocks are loaded in serial mode during startup. If a large number of files and folders are stored on the HDFS, loading of the two sections is time-consuming, prolonging the HDFS startup time. HDFS NameNode divides each type of metadata by segments and stores the data in multiple sections when generating the FsImage files. When the NameNodes start, sections are loaded in parallel mode. This accelerates the HDFS startup. + +Enhanced Open Source Feature: Label-based Block Placement Policies (HDFS Nodelabel) +----------------------------------------------------------------------------------- + +You need to configure the nodes for storing HDFS file data blocks based on data features. You can configure a label expression to an HDFS directory or file and assign one or more labels to a DataNode so that file data blocks can be stored on specified DataNodes. If the label-based data block placement policy is used for selecting DataNodes to store the specified files, the DataNode range is specified based on the label expression. Then proper nodes are selected from the specified range. + +- You can store the replicas of data blocks to the nodes with different labels accordingly. For example, store two replicas of the data block to the node labeled with L1, and store other replicas of the data block to the nodes labeled with L2. +- You can set the policy in case of block placement failure, for example, select a node from all nodes randomly. + +:ref:`Figure 7 ` gives an example: + +- Data in **/HBase** is stored in A, B, and D. +- Data in **/Spark** is stored in A, B, D, E, and F. +- Data in **/user** is stored in C, D, and F. +- Data in **/user/shl** is stored in A, E, and F. + +.. _mrs_08_00074__fa088a82d2b5041b29cfd31b83b1bb6fc: + +.. figure:: /_static/images/en-us_image_0000001296590622.png + :alt: **Figure 7** Example of label-based block placement policy + + **Figure 7** Example of label-based block placement policy + +Enhanced Open Source Feature: HDFS Load Balance +----------------------------------------------- + +The current read and write policies of HDFS are mainly for local optimization without considering the actual load of nodes or disks. Based on I/O loads of different nodes, the load balance of HDFS ensures that when read and write operations are performed on the HDFS client, the node with low I/O load is selected to perform such operations to balance I/O load and fully utilize the overall throughput of the cluster. + +If HDFS Load Balance is enabled during file writing, the NameNode selects a DataNode (in the order of local node, local rack, and remote rack). If the I/O load of the selected node is heavy, the NameNode will choose another DataNode with lighter load. + +If HDFS Load Balance is enabled during file reading, an HDFS client sends a request to the NameNode to provide the list of DataNodes that store the block to be read. The NameNode returns a list of DataNodes sorted by distance in the network topology. With the HDFS Load Balance feature, the DataNodes on the list are also sorted by their I/O load. The DataNodes with heavy load are at the bottom of the list. + +Enhanced Open Source Feature: HDFS Auto Data Movement +----------------------------------------------------- + +Hadoop has been used for batch processing of immense data in a long time. The existing HDFS model is used to fit the needs of batch processing applications very well because such applications focus more on throughput than delay. + +However, as Hadoop is increasingly used for upper-layer applications that demand frequent random I/O access such as Hive and HBase, low latency disks such as solid state disk (SSD) are favored in delay-sensitive scenarios. To cater to the trend, HDFS supports a variety of storage types. Users can choose a storage type according to their needs. + +Storage policies vary depending on how frequently data is used. For example, if data that is frequently accessed in the HDFS is marked as **ALL_SSD** or **HOT**, the data that is accessed several times may be marked as **WARM**, and data that is rarely accessed (only once or twice access) can be marked as **COLD**. You can select different data storage policies based on the data access frequency. + +|image1| + +However, low latency disks are far more expensive than spinning disks. Data typically sees heavy initial usage with decline in usage over a period of time. Therefore, it can be useful if data that is no longer used is moved out from expensive disks to cheaper ones storage media. + +A typical example is storage of detail records. New detail records are imported into SSD because they are frequently queried by upper-layer applications. As access frequency to these detail records declines, they are moved to cheaper storage. + +Before automatic data movement is achieved, you have to manually determine by service type whether data is frequently used, manually set a data storage policy, and manually trigger the HDFS Auto Data Movement Tool, as shown in the figure below. + +|image2| + +If aged data can be automatically identified and moved to cheaper storage (such as disk/archive), you will see significant cost cuts and data management efficiency improvement. + +The HDFS Auto Data Movement Tool is at the core of HDFS Auto Data Movement. It automatically sets a storage policy depending on how frequently data is used. Specifically, functions of the HDFS Auto Data Movement Tool can: + +- Mark a data storage policy as **All_SSD**, **One_SSD**, **Hot**, **Warm**, **Cold**, or **FROZEN** according to age, access time, and manual data movement rules. + +- Define rules for distinguishing cold and hot data based on the data age, access time, and manual migration rules. + +- Define the action to be taken if age-based rules are met. + + **MARK**: the action for identifying whether data is frequently or rarely used based on the age rules and setting a data storage policy. **MOVE**: the action for invoking the HDFS Auto Data Movement Tool and moving data based on the age rules to identify whether data is frequently or rarely used after you have determined the corresponding storage policy. + + - **MARK**: identifies whether data is frequently or rarely used and sets the data storage policy. + - **MOVE**: the action for invoking the HDFS Auto Data Movement Tool and moving data across tiers. + - **SET_REPL**: the action for setting new replica quantity for a file. + - **MOVE_TO_FOLDER**: the action for moving files to a target folder. + - **DELETE**: the action for deleting a file or directory. + - **SET_NODE_LABEL**: the action for setting node labels of a file. + +With the HDFS Auto Data Movement feature, you only need to define age based on access time rules. HDFS Auto Data Movement Tool matches data according to age-based rules, sets storage policies, and moves data. In this way, data management efficiency and cluster resource efficiency are improved. + +.. |image1| image:: /_static/images/en-us_image_0000001349309925.png +.. |image2| image:: /_static/images/en-us_image_0000001349309929.png diff --git a/umn/source/overview/components/hdfs/hdfs_ha_solution.rst b/umn/source/overview/components/hdfs/hdfs_ha_solution.rst new file mode 100644 index 0000000..609219f --- /dev/null +++ b/umn/source/overview/components/hdfs/hdfs_ha_solution.rst @@ -0,0 +1,38 @@ +:original_name: mrs_08_00072.html + +.. _mrs_08_00072: + +HDFS HA Solution +================ + +HDFS HA Background +------------------ + +In versions earlier than Hadoop 2.0.0, SPOF occurs in the HDFS cluster. Each cluster has only one NameNode. If the host where the NameNode is located is faulty, the HDFS cluster cannot be used unless the NameNode is restarted or started on another host. This affects the overall availability of HDFS in the following aspects: + +#. In the case of an unplanned event such as host breakdown, the cluster would be unavailable until the NameNode is restarted. +#. Planned maintenance tasks, such as software and hardware upgrade, will cause the cluster stop working. + +To solve the preceding problems, the HDFS HA solution enables a hot-swap NameNode backup for NameNodes in a cluster in automatic or manual (configurable) mode. When a machine fails (due to hardware failure), the active/standby NameNode switches over automatically in a short time. When the active NameNode needs to be maintained, the administrator can manually perform an active/standby NameNode switchover to ensure cluster availability during maintenance. For details about the automatic failover of HDFS, see `https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Automatic_Failover `__. + +HDFS HA Implementation +---------------------- + +.. _mrs_08_00072__f5ec3ec739f8e4c9585be36835952f04b: + +.. figure:: /_static/images/en-us_image_0000001349110445.jpg + :alt: **Figure 1** Typical HA deployment + + **Figure 1** Typical HA deployment + +In a typical HA cluster (as shown in :ref:`Figure 1 `), two NameNodes need to be configured on two independent servers, respectively. At any time point, one NameNode is in the active state, and the other NameNode is in the standby state. The active NameNode is responsible for all client operations in the cluster, while the standby NameNode maintains synchronization with the active node to provide fast switchover if necessary. + +To keep the data synchronized with each other, both nodes communicate with a group of JournalNodes. When the active node modifies any file system's metadata, it will store the modification log to a majority of these JournalNodes. For example, if there are three JournalNodes, then the log will be saved on two of them at least. The standby node monitors changes of JournalNodes and synchronizes changes from the active node. Based on the modification log, the standby node applies the changes to the metadata of the local file system. Once a switchover occurs, the standby node can ensure its status is the same as that of the active node. This ensures that the metadata of the file system is synchronized between the active and standby nodes if the switchover is incurred by the failure of the active node. + +To ensure fast switchover, the standby node needs to have the latest block information. Therefore, DataNodes send block information and heartbeat messages to two NameNodes at the same time. + +It is vital for an HA cluster that only one of the NameNodes be active at any time. Otherwise, the namespace state would split into two parts, risking data loss or other incorrect results. To prevent the so-called "split-brain scenario", the JournalNodes will only ever allow a single NameNode to write data to it at a time. During switchover, the NameNode which is to become active will take over the role of writing data to JournalNodes. This effectively prevents the other NameNodes from being in the active state, allowing the new active node to safely proceed with switchover. + +For more information about the HDFS HA solution, visit the following website: + +http://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html diff --git a/umn/source/overview/components/hdfs/index.rst b/umn/source/overview/components/hdfs/index.rst new file mode 100644 index 0000000..9ba1672 --- /dev/null +++ b/umn/source/overview/components/hdfs/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0007.html + +.. _mrs_08_0007: + +HDFS +==== + +- :ref:`HDFS Basic Principles ` +- :ref:`HDFS HA Solution ` +- :ref:`Relationship Between HDFS and Other Components ` +- :ref:`HDFS Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hdfs_basic_principles + hdfs_ha_solution + relationship_between_hdfs_and_other_components + hdfs_enhanced_open_source_features diff --git a/umn/source/overview/components/hdfs/relationship_between_hdfs_and_other_components.rst b/umn/source/overview/components/hdfs/relationship_between_hdfs_and_other_components.rst new file mode 100644 index 0000000..4b9d8cd --- /dev/null +++ b/umn/source/overview/components/hdfs/relationship_between_hdfs_and_other_components.rst @@ -0,0 +1,76 @@ +:original_name: mrs_08_00073.html + +.. _mrs_08_00073: + +Relationship Between HDFS and Other Components +============================================== + +Relationship Between HDFS and HBase +----------------------------------- + +HDFS is a subproject of Apache Hadoop, which is used as the file storage system for HBase. HBase is located in the structured storage layer. HDFS provides highly reliable support for lower-layer storage of HBase. All the data files of HBase can be stored in the HDFS, except some log files generated by HBase. + +Relationship Between HDFS and MapReduce +--------------------------------------- + +- HDFS features high fault tolerance and high throughput, and can be deployed on low-cost hardware for storing data of applications with massive data sets. +- MapReduce is a programming model used for parallel computation of large data sets (larger than 1 TB). Data computed by MapReduce comes from multiple data sources, such as Local FileSystem, HDFS, and databases. Most data comes from the HDFS. The high throughput of HDFS can be used to read massive data. After being computed, data can be stored in HDFS. + +Relationship Between HDFS and Spark +----------------------------------- + +Data computed by Spark comes from multiple data sources, such as local files and HDFS. Most data comes from HDFS which can read data in large scale for parallel computing. After being computed, data can be stored in HDFS. + +Spark involves Driver and Executor. Driver schedules tasks and Executor runs tasks. + +:ref:`Figure 1 ` shows how data is read from a file. + +.. _mrs_08_00073__f16fc0cc29e824ad59e843d9f305f6491: + +.. figure:: /_static/images/en-us_image_0000001349390697.png + :alt: **Figure 1** File reading process + + **Figure 1** File reading process + +The file reading process is as follows: + +#. Driver interconnects with HDFS to obtain the information of File A. +#. The HDFS returns the detailed block information about this file. +#. Driver sets a parallel degree based on the block data amount, and creates multiple tasks to read the blocks of this file. +#. Executor runs the tasks and reads the detailed blocks as part of the Resilient Distributed Dataset (RDD). + +:ref:`Figure 2 ` shows how data is written to a file. + +.. _mrs_08_00073__fa00b568f5217442e9e4fd0ffdc06c1e9: + +.. figure:: /_static/images/en-us_image_0000001296270866.png + :alt: **Figure 2** File writing process + + **Figure 2** File writing process + +The file writing process is as follows: + +#. .. _mrs_08_00073__ld9100663d06c4f45aff1426e23d00085: + + Driver creates a directory where the file is to be written. + +#. Based on the RDD distribution status, the number of tasks related to data writing is computed, and these tasks are sent to Executor. + +#. Executor runs these tasks, and writes the computed RDD data to the directory created in :ref:`1 `. + +Relationship Between HDFS and ZooKeeper +--------------------------------------- + +:ref:`Figure 3 ` shows the relationship between ZooKeeper and HDFS. + +.. _mrs_08_00073__f4a8277013bf1470083aa5e19083915d2: + +.. figure:: /_static/images/en-us_image_0000001349309985.png + :alt: **Figure 3** Relationship between ZooKeeper and HDFS + + **Figure 3** Relationship between ZooKeeper and HDFS + +As the client of a ZooKeeper cluster, ZKFailoverController (ZKFC) monitors the status of NameNode. ZKFC is deployed only in the node where NameNode resides, and in both the active and standby HDFS NameNodes. + +#. The ZKFC connects to ZooKeeper and saves information such as host names to ZooKeeper under the znode directory **/hadoop-ha**. NameNode that creates the directory first is considered as the active node, and the other is the standby node. NameNodes read the NameNode information periodically through ZooKeeper. +#. When the process of the active node ends abnormally, the standby NameNode detects changes in the **/hadoop-ha** directory through ZooKeeper, and then takes over the service of the active NameNode. diff --git a/umn/source/overview/components/hetuengine/hetuengine_product_overview.rst b/umn/source/overview/components/hetuengine/hetuengine_product_overview.rst new file mode 100644 index 0000000..8d2c74e --- /dev/null +++ b/umn/source/overview/components/hetuengine/hetuengine_product_overview.rst @@ -0,0 +1,48 @@ +:original_name: mrs_08_00681.html + +.. _mrs_08_00681: + +HetuEngine Product Overview +=========================== + +This section applies only to MRS 3.1.2-LTS.3. + +HetuEngine Description +---------------------- + +HetuEngine is an in-house high-performance, interactive SQL analysis and data virtualization engine. It seamlessly integrates with the big data ecosystem to implement interactive query of massive amounts of data within seconds, and supports cross-source and cross-domain unified data access to enable one-stop SQL convergence analysis in the data lake, between lakes, and between lakehouses. + +HetuEngine Architecture +----------------------- + +HetuEngine consists of different modules. :ref:`Figure 1 ` shows the architecture. + +.. _mrs_08_00681__fig1249037338: + +.. figure:: /_static/images/en-us_image_0000001440400425.png + :alt: **Figure 1** HetuEngine architecture + + **Figure 1** HetuEngine architecture + +.. table:: **Table 1** Module description + + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Concept | Description | + +=====================+=====================+==========================================================================================================================================================================+ + | Cloud service layer | HetuEngine CLI/JDBC | HetuEngine client, through which the query request is submitted and the results is returned and displayed. | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HSBroker | Service management component of HetuEngine. It manages and verifies compute instances, monitors health status, and performs automatic maintenance. | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HSConsole | Provides visualized operation GUIs and RESTful APIs for data source information management, compute instance management, and automatic task query. | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | HSFabric | Provides a unified SQL access entry to meet the requirements for high-performing and highly secure data transfer across domains (data centers). | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Engine layer | Coordinator | Management node of HetuEngine compute instances. It receives and parses SQL statements, generates and optimizes execution plans, assigns tasks, and schedules resources. | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Worker | Work node of HetuEngine compute instances. It provides capabilities such as parallel data pulling from data sources and distributed SQL computing. | + +---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +HetuEngine Application Scenarios +-------------------------------- + +HetuEngine supports cross-source (multiple data sources, such as Hive, HBase, GaussDB(DWS), Elasticsearch, and ClickHouse) and cross-domain (multiple regions or data centers) quick joint query, especially for interactive quick query of Hive and Hudi data in the Hadoop cluster (MRS). diff --git a/umn/source/overview/components/hetuengine/index.rst b/umn/source/overview/components/hetuengine/index.rst new file mode 100644 index 0000000..58e71f5 --- /dev/null +++ b/umn/source/overview/components/hetuengine/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0068.html + +.. _mrs_08_0068: + +HetuEngine +========== + +- :ref:`HetuEngine Product Overview ` +- :ref:`Relationship Between HetuEngine and Other Components ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hetuengine_product_overview + relationship_between_hetuengine_and_other_components diff --git a/umn/source/overview/components/hetuengine/relationship_between_hetuengine_and_other_components.rst b/umn/source/overview/components/hetuengine/relationship_between_hetuengine_and_other_components.rst new file mode 100644 index 0000000..6273c91 --- /dev/null +++ b/umn/source/overview/components/hetuengine/relationship_between_hetuengine_and_other_components.rst @@ -0,0 +1,28 @@ +:original_name: mrs_08_00682.html + +.. _mrs_08_00682: + +Relationship Between HetuEngine and Other Components +==================================================== + +The HetuEngine installation depends on the MRS cluster. :ref:`Table 1 ` lists the components on which the HetuServer installation depends. + +.. _mrs_08_00682__en-us_topic_0254454590_table5889013405: + +.. table:: **Table 1** Components on which HetuEngine depends + + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +===========+====================================================================================================================================================================+ + | HDFS | Hadoop Distributed File System, supporting high-throughput data access and suitable for applications with large-scale data sets. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hive | Open-source data warehouse built on Hadoop. It stores structured data and implements basic data analysis using the Hive Query Language (HQL), a SQL-like language. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper | Enables highly reliable distributed coordination. It helps prevent single point of failures (SPOFs) and provides reliable services for applications. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | KrbServer | Key management center that distributes bills. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Yarn | Resource management system, which is a general resource module that manages and schedules resources for various applications. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DBService | DBService is a high-availability relational database storage system that provides metadata backup and restoration functions. | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/overview/components/hive/enhanced_open_source_feature.rst b/umn/source/overview/components/hive/enhanced_open_source_feature.rst new file mode 100644 index 0000000..0635fc9 --- /dev/null +++ b/umn/source/overview/components/hive/enhanced_open_source_feature.rst @@ -0,0 +1,68 @@ +:original_name: mrs_08_00114.html + +.. _mrs_08_00114: + +Enhanced Open Source Feature +============================ + +Enhanced Open Source Feature: HDFS Colocation +--------------------------------------------- + +HDFS Colocation is the data location control function provided by HDFS. The HDFS Colocation API stores associated data or data on which associated operations are performed on the same storage node. + +Hive supports HDFS Colocation. When Hive tables are created, after the locator information is set for table files, the data files of related tables are stored on the same storage node. This ensures convenient and efficient data computing among associated tables. + +Enhanced Open Source Feature: Column Encryption +----------------------------------------------- + +Hive supports encryption of one or more columns. The columns to be encrypted and the encryption algorithm can be specified when a Hive table is created. When data is inserted into the table using the INSERT statement, the related columns are encrypted. The Hive column encryption does not support views and the Hive over HBase scenario. + +The Hive column encryption mechanism supports two encryption algorithms that can be selected to meet site requirements during table creation: + +- AES (the encryption class is **org.apache.hadoop.hive.serde2.AESRewriter**) +- SMS4 (the encryption class is **org.apache.hadoop.hive.serde2.SMS4Rewriter**) + +Enhanced Open Source Feature: HBase Deletion +-------------------------------------------- + +Due to the limitations of underlying storage systems, Hive does not support the ability to delete a single piece of table data. In Hive on HBase, Hive in the MRS solution supports the ability to delete a single piece of HBase table data. Using a specific syntax, Hive can delete one or more pieces of data from an HBase table. + +Enhanced Open Source Feature: Row Delimiter +------------------------------------------- + +In most cases, a carriage return character is used as the row delimiter in Hive tables stored in text files, that is, the carriage return character is used as the terminator of a row during queries. + +However, some data files are delimited by special characters, and not a carriage return character. + +MRS Hive allows you to specify different characters or character combinations as row delimiters for Hive data in text files. + +Enhanced Open Source Feature: HTTPS/HTTP-based REST API Switchover +------------------------------------------------------------------ + +WebHCat provides external REST APIs for Hive. By default, the open source community version uses the HTTP protocol. + +MRS Hive supports the HTTPS protocol that is more secure, and enables switchover between the HTTP protocol and the HTTPS protocol. + +Enhanced Open Source Feature: Transform Function +------------------------------------------------ + +The Transform function is not allowed by Hive of the open source version. MRS Hive supports the configuration of the Transform function. The function is disabled by default, which is the same as that of the open source community version. + +Users can modify configurations of the Transform function to enable the function. However, security risks exist when the Transform function is enabled. + +Enhanced Open Source Feature: Temporary Function Creation Without ADMIN Permission +---------------------------------------------------------------------------------- + +You must have **ADMIN** permission when creating temporary functions on Hive of the open source community version. MRS Hive supports the configuration of the function for creating temporary functions with **ADMIN** permission. The function is disabled by default, which is the same as that of the open-source community version. + +You can modify configurations of this function. After the function is enabled, you can create temporary functions without **ADMIN** permission. + +Enhanced Open Source Feature: Database Authorization +---------------------------------------------------- + +In the Hive open source community version, only the database owner can create tables in the database. You can be granted with the **CREATE** and **SELECT** permissions on tables by MRS Hive in a database. After you are granted with the permission to query data in the database, the system automatically associates the query permission on all tables in the database. + +Enhanced Open Source Feature: Column Authorization +-------------------------------------------------- + +The Hive open source community version supports only table-level permission control. MRS Hive supports column-level permission control. You can be granted with column-level permissions, such as **SELECT**, **INSERT**, and **UPDATE**. diff --git a/umn/source/overview/components/hive/hive_basic_principles.rst b/umn/source/overview/components/hive/hive_basic_principles.rst new file mode 100644 index 0000000..4ae17dd --- /dev/null +++ b/umn/source/overview/components/hive/hive_basic_principles.rst @@ -0,0 +1,85 @@ +:original_name: mrs_08_001101.html + +.. _mrs_08_001101: + +Hive Basic Principles +===================== + +`Hive `__ is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used to extract, transform, and load (ETL) data. Hive is a mechanism that can store, query, and analyze mass data stored on Hadoop. Hive defines simple SQL-like query language, which is known as HiveQL. It allows a user familiar with SQL to query data. Hive data computing depends on MapReduce, Spark, and Tez. + +The new execution engine `Tez `__ is used to replace the original MapReduce, greatly improving performance. Tez can convert multiple dependent jobs into one job, so only once HDFS write is required and fewer transit nodes are needed, greatly improving the performance of DAG jobs. + +Hive provides the following functions: + +- Analyzes massive structured data and summarizes analysis results. +- Allows complex MapReduce jobs to be compiled in SQL languages. +- Supports flexible data storage formats, including JavaScript object notation (JSON), comma separated values (CSV), TextFile, RCFile, SequenceFile, and ORC (Optimized Row Columnar). + +Hive system structure: + +- User interface: Three user interfaces are available, that is, CLI, Client, and WUI. CLI is the most frequently-used user interface. A Hive transcript is started when CLI is started. Client refers to a Hive client, and a client user connects to the Hive Server. When entering the client mode, you need to specify the node where the Hive Server resides and start the Hive Server on this node. The web UI is used to access Hive through a browser. MRS can access Hive only in client mode. +- Metadata storage: Hive stores metadata into databases, for example, MySQL and Derby. Metadata in Hive includes a table name, table columns and partitions and their properties, table properties (indicating whether a table is an external table), and the directory where table data is stored. + +Hive Framework +-------------- + +Hive is a single-instance service process that provides services by translating HQL into related MapReduce jobs or HDFS operations. :ref:`Figure 1 ` shows how Hive is connected to other components. + +.. _mrs_08_001101__fig47605446155228: + +.. figure:: /_static/images/en-us_image_0000001349390653.png + :alt: **Figure 1** Hive framework + + **Figure 1** Hive framework + +.. table:: **Table 1** Module description + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+======================================================================================================================================================================================================================================================+ + | HiveServer | Multiple HiveServers can be deployed in a cluster to share loads. HiveServer provides Hive database services externally, translates HQL statements into related YARN tasks or HDFS operations to complete data extraction, conversion, and analysis. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MetaStore | - Multiple MetaStores can be deployed in a cluster to share loads. MetaStore provides Hive metadata services as well as reads, writes, maintains, and modifies the structure and properties of Hive tables. | + | | - MetaStore provides Thrift APIs for HiveServer, Spark, WebHCat, and other MetaStore clients to access and operate metadata. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | WebHCat | Multiple WebHCats can be deployed in a cluster to share loads. WebHCat provides REST APIs and runs the Hive commands through the REST APIs to submit MapReduce jobs. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hive client | Hive client includes the human-machine command-line interface (CLI) Beeline, JDBC drive for JDBC applications, Python driver for Python applications, and HCatalog JAR files for MapReduce. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper cluster | As a temporary node, ZooKeeper records the IP address list of each HiveServer instance. The client driver connects to ZooKeeper to obtain the list and selects corresponding HiveServer instances based on the routing mechanism. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HDFS/HBase cluster | The HDFS cluster stores the Hive table data. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MapReduce/YARN cluster | Provides distributed computing services. Most Hive data operations rely on MapReduce. The main function of HiveServer is to translate HQL statements into MapReduce jobs to process massive data. | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +HCatalog is built on Hive Metastore and incorporates the DDL capability of Hive. HCatalog is also a Hadoop-based table and storage management layer that enables convenient data read/write on tables of HDFS by using different data processing tools such as Pig and MapReduce. Besides, HCatalog also provides read/write APIs for these tools and uses a Hive CLI to publish commands for defining data and querying metadata. After encapsulating these commands, WebHCat Server can provide RESTful APIs, as shown in :ref:`Figure 2 `. + +.. _mrs_08_001101__hive_f2: + +.. figure:: /_static/images/en-us_image_0000001296750254.png + :alt: **Figure 2** WebHCat logical architecture + + **Figure 2** WebHCat logical architecture + +Principles +---------- + +Hive functions as a data warehouse based on HDFS and MapReduce architecture and translates HQL statements into MapReduce jobs or HDFS operations. For details about Hive and HQL, see `HiveQL Language Manual `__. + +:ref:`Figure 3 ` shows the Hive structure. + +- **Metastore**: reads, writes, and updates metadata such as tables, columns, and partitions. Its lower layer is relational databases. +- **Driver**: manages the lifecycle of HiveQL execution and participates in the entire Hive job execution. +- **Compiler**: translates HQL statements into a series of interdependent Map or Reduce jobs. +- **Optimizer**: is classified into logical optimizer and physical optimizer to optimize HQL execution plans and MapReduce jobs, respectively. +- **Executor**: runs Map or Reduce jobs based on job dependencies. +- **ThriftServer**: functions as the servers of JDBC, provides Thrift APIs, and integrates with Hive and other applications. +- **Clients**: include the WebUI and JDBC APIs and provides APIs for user access. + +.. _mrs_08_001101__fig10440368155335: + +.. figure:: /_static/images/en-us_image_0000001296590638.png + :alt: **Figure 3** Hive framework + + **Figure 3** Hive framework diff --git a/umn/source/overview/components/hive/hive_cbo_principles.rst b/umn/source/overview/components/hive/hive_cbo_principles.rst new file mode 100644 index 0000000..8e5bc52 --- /dev/null +++ b/umn/source/overview/components/hive/hive_cbo_principles.rst @@ -0,0 +1,118 @@ +:original_name: mrs_08_00112.html + +.. _mrs_08_00112: + +Hive CBO Principles +=================== + + +Hive CBO Principles +------------------- + +CBO is short for Cost-Based Optimization. + +It will optimize the following: + +During compilation, the CBO calculates the most efficient join sequence based on tables and query conditions involved in query statements to reduce time and resources required for query. + +In Hive, the CBO is implemented as follows: + +Hive uses open-source component Apache Calcite to implement the CBO. SQL statements are first converted into Hive Abstract Syntax Trees (ASTs) and then into RelNodes that can be identified by Calcite. After Calcite adjusts the join sequence in RelNodes, RelNodes are converted into ASTs by Hive to continue the logical and physical optimization. :ref:`Figure 1 ` shows the working flow. + +.. _mrs_08_00112__fig567676115845: + +.. figure:: /_static/images/en-us_image_0000001349390693.png + :alt: **Figure 1** CBO Implementation process + + **Figure 1** CBO Implementation process + +Calcite adjusts the join sequence as follows: + +#. A table is selected as the first table from the tables to be joined. +#. The second and third tables are selected based on the cost. In this way, multiple different execution plans are obtained. +#. A plan with the minimum costs is calculated and serves as the final sequence. + +The cost calculation method is as follows: + +In the current version, costs are measured based on the number of data entries after joining. Fewer data entries mean less cost. The number of joined data entries depends on the selection rate of joined tables. The number of data entries in a table is obtained based on the table-level statistics. + +The number of data entries in a table after filtering is estimated based on the column-level statistics, including the maximum values (max), minimum values (min), and Number of Distinct Values (NDV). + +For example, there is a table **table_a** whose total number of data records is 1,000,000 and NDV is 50. The query conditions are as follows: + +.. code-block:: + + Select * from table_a where colum_a='value1'; + +The estimated number of queried data entries is: 1,000,000 x 1/50 = 20,000. The selection rate is 2%. + +The following takes the TPC-DS Q3 as an example to describe how the CBO adjusts the join sequence: + +.. code-block:: + + select + dt.d_year, + item.i_brand_id brand_id, + item.i_brand brand, + sum(ss_ext_sales_price) sum_agg + from + date_dim dt, + store_sales, + item + where + dt.d_date_sk = store_sales.ss_sold_date_sk + and store_sales.ss_item_sk = item.i_item_sk + and item.i_manufact_id = 436 + and dt.d_moy = 12 + group by dt.d_year , item.i_brand , item.i_brand_id + order by dt.d_year , sum_agg desc , brand_id + limit 10; + +Statement explanation: This statement indicates that inner join is performed for three tables: table **store_sales** is a fact table with about 2,900,000,000 data entries, table **date_dim** is a dimension table with about 73,000 data entries, and table **item** is a dimension table with about 18,000 data entries. Each table has filtering conditions. :ref:`Figure 2 ` shows the join relationship. + +.. _mrs_08_00112__fig27230352161945: + +.. figure:: /_static/images/en-us_image_0000001349309981.png + :alt: **Figure 2** Join relationship + + **Figure 2** Join relationship + +The CBO must first select the tables that bring the best filtering effect for joining. + +By analyzing min, max, NDV, and the number of data entries, the CBO estimates the selection rates of different dimension tables, as shown in :ref:`Table 1 `. + +.. _mrs_08_00112__table21230583162227: + +.. table:: **Table 1** Data filtering + + +----------+---------------------------------+----------------------------------------+----------------+ + | Table | Number of Original Data Entries | Number of Data Entries After Filtering | Selection Rate | + +==========+=================================+========================================+================+ + | date_dim | 73,000 | 6,200 | 8.5% | + +----------+---------------------------------+----------------------------------------+----------------+ + | item | 18,000 | 19 | 0.1% | + +----------+---------------------------------+----------------------------------------+----------------+ + +The selection rate can be estimated as follows: Selection rate = Number of data entries after filtering/Number of original data entries + +As shown in the preceding table, the **item** table has a better filtering effect. Therefore, the CBO joins the **item** table first before joining the **date_dim** table. + +:ref:`Figure 3 ` shows the join process when the CBO is disabled. + +.. _mrs_08_00112__fig46862216163921: + +.. figure:: /_static/images/en-us_image_0000001296590682.png + :alt: **Figure 3** Join process when the CBO is disabled + + **Figure 3** Join process when the CBO is disabled + +:ref:`Figure 4 ` shows the join process when the CBO is enabled. + +.. _mrs_08_00112__fig47832930164032: + +.. figure:: /_static/images/en-us_image_0000001349190397.png + :alt: **Figure 4** Join process when the CBO is enabled + + **Figure 4** Join process when the CBO is enabled + +After the CBO is enabled, the number of intermediate data entries is reduced from 495,000,000 to 2,900,000 and thus the execution time can be remarkably reduced. diff --git a/umn/source/overview/components/hive/index.rst b/umn/source/overview/components/hive/index.rst new file mode 100644 index 0000000..0316ebc --- /dev/null +++ b/umn/source/overview/components/hive/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0011.html + +.. _mrs_08_0011: + +Hive +==== + +- :ref:`Hive Basic Principles ` +- :ref:`Hive CBO Principles ` +- :ref:`Relationship Between Hive and Other Components ` +- :ref:`Enhanced Open Source Feature ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hive_basic_principles + hive_cbo_principles + relationship_between_hive_and_other_components + enhanced_open_source_feature diff --git a/umn/source/overview/components/hive/relationship_between_hive_and_other_components.rst b/umn/source/overview/components/hive/relationship_between_hive_and_other_components.rst new file mode 100644 index 0000000..3109336 --- /dev/null +++ b/umn/source/overview/components/hive/relationship_between_hive_and_other_components.rst @@ -0,0 +1,26 @@ +:original_name: mrs_08_001103.html + +.. _mrs_08_001103: + +Relationship Between Hive and Other Components +============================================== + +Relationship Between Hive and HDFS +---------------------------------- + +Hive is a sub-project of Apache Hadoop, which uses HDFS as the file storage system. It parses and processes structured data with highly reliable underlying storage supported by HDFS. All data files in the Hive database are stored in HDFS, and all data operations on Hive are also performed using HDFS APIs. + +Relationship Between Hive and MapReduce +--------------------------------------- + +Hive data computing depends on MapReduce. MapReduce is also a sub-project of Apache Hadoop and is a parallel computing framework based on HDFS. During data analysis, Hive parses HQL statements submitted by users into MapReduce tasks and submits the tasks for MapReduce to execute. + +Relationship Between Hive and Tez +--------------------------------- + +Tez, an open-source project of Apache, is a distributed computing framework that supports directed acyclic graphs (DAGs). When Hive uses the Tez engine to analyze data, it parses HQL statements submitted by users into Tez tasks and submits the tasks to Tez for execution. + +Relationship Between Hive and DBService +--------------------------------------- + +MetaStore (metadata service) of Hive processes the structure and attribute information of Hive metadata, such as Hive databases, tables, and partitions. The information needs to be stored in a relational database and is managed and processed by MetaStore. In the product, the metadata of Hive is stored and maintained by the DBService component, and the metadata service is provided by the Metadata component. diff --git a/umn/source/overview/components/hudi.rst b/umn/source/overview/components/hudi.rst new file mode 100644 index 0000000..df90ece --- /dev/null +++ b/umn/source/overview/components/hudi.rst @@ -0,0 +1,76 @@ +:original_name: mrs_08_0083.html + +.. _mrs_08_0083: + +Hudi +==== + +Hudi is the file organization layer of the data lake. It manages Parquet files, provides the data lake capability, and supports multiple compute engines. It also provides insert, update, and deletion (IUD) interfaces and streaming primitives for inserting, updating, and incremental pulling on HDFS datasets. + +.. note:: + + To use Hudi, ensure that the Spark2x service has been installed in the MRS cluster. + + +.. figure:: /_static/images/en-us_image_0000001349190321.png + :alt: **Figure 1** Basic architecture of Hudi + + **Figure 1** Basic architecture of Hudi + +Feature +------- + +- The ACID transaction capability supports real-time data import to the lake and batch data import to the data lake. +- Multiple view capabilities (read-optimized view/incremental view/real-time view) enable quick data analysis. +- Multi-version concurrency control (MVCC) design supports data version backtracking. +- Automatic management of file sizes and layouts optimizes query performance and provides quasi-real-time data for queries. +- Concurrent read and write are supported. Data can be read when being written based on snapshot isolation. +- Bootstrapping is supported to convert existing tables into Hudi datasets. + +Key Technologies and Advantages +------------------------------- + +- Pluggable index mechanism: Hudi provides multiple index mechanisms to quickly update and delete massive data. +- Ecosystem support: Hudi supports multiple data engines, including Hive, Spark, HetuEngine, and Flink. + +Two Types of Tables Supported by Hudi +------------------------------------- + +- Copy On Write + + Copy-on-write tables are also called COW tables. Parquet files are used to store data, and internal update operations need to be performed by rewriting the original Parquet files. + + - Advantage: It is efficient because only one data file in the corresponding partition needs to be read. + - Disadvantage: During data write, a previous copy needs to be copied and then a new data file is generated based on the previous copy. This process is time-consuming. Therefore, the data read by the read request lags behind. + +- Merge On Read + + Merge-on-read tables are also called MOR tables. The combination of columnar-based Parquet and row-based format Avro is used to store data. Parquet files are used to store base data, and Avro files (also called log files) are used to store incremental data. + + - Advantage: Data is written to the delta log first, and the delta log size is small. Therefore, the write cost is low. + - Disadvantage: Files need to be compacted periodically. Otherwise, there are a large number of fragment files. The read performance is poor because delta logs and old data files need to be merged. + +Hudi Supporting Three Types Of Views for Read Capabilities in Different Scenarios +--------------------------------------------------------------------------------- + +- Snapshot View + + Provides the latest snapshot data of the current Hudi table. That is, once the latest data is written to the Hudi table, the newly written data can be queried through this view. + + Both COW and MOR tables support this view capability. + +- Incremental View + + Provides the incremental query capability. The incremental data after a specified commit can be queried. This view can be used to quickly pull incremental data. + + COW tables support this view capability. MOR tables also support this view capability, but the incremental view capability disappears once the compact operation is performed. + +- Read Optimized View + + Provides only the data stored in the latest Parquet file. + + This view is different for COW and MOR tables. + + For COW tables, the view capability is the same as the real-time view capability. (COW tables use only Parquet files to store data.) + + For MOR tables, only base files are accessed, and the data in the given file slices since the last compact operation is provided. It can be simply understood that this view provides only the data stored in Parquet files of MOR tables, and the data in log files is ignored. The data provided by this view may not be the latest. However, once the compact operation is performed on MOR tables, the incremental log data is merged into the base data. In this case, this view has the same capability as the real-time view. diff --git a/umn/source/overview/components/hue/hue_basic_principles.rst b/umn/source/overview/components/hue/hue_basic_principles.rst new file mode 100644 index 0000000..113dcb6 --- /dev/null +++ b/umn/source/overview/components/hue/hue_basic_principles.rst @@ -0,0 +1,72 @@ +:original_name: mrs_08_00121.html + +.. _mrs_08_00121: + +Hue Basic Principles +==================== + +Hue is a group of web applications that interact with MRS big data components. It helps you browse HDFS, perform Hive query, and start MapReduce jobs. Hue bears applications that interact with all MRS big data components. + +Hue provides the file browser and query editor functions: + +- File browser allows you to directly browse and operate different HDFS directories on the GUI. + +- Query editor can write simple SQL statements to query data stored on Hadoop, for example, HDFS, HBase, and Hive. With the query editor, you can easily create, manage, and execute SQL statements and download the execution results as an Excel file. + +On the WebUI provided by Hue, you can perform the following operations on the components: + +- HDFS: + + - View, create, manage, rename, move, and delete files or directories. + - File upload and download + - Search for files, directories, file owners, and user groups; change the owners and permissions of the files and directories. + - Manually configure HDFS directory storage policies and dynamic storage policies. + +- Hive: + + - Edit and execute SQL/HQL statements. Save, copy, and edit the SQL/HQL template. Explain SQL/HQL statements. Save the SQL/HQL statement and query it. + - Database presentation and data table presentation + - Supporting different types of Hadoop storage + - Use MetaStore to add, delete, modify, and query databases, tables, and views. + + .. note:: + + If Internet Explorer is used to access the Hue page to execute HiveSQL statements, the execution fails, because the browser has functional problems. You are advised to use a compatible browser, for example, Google Chrome. + +- MapReduce: Check MapReduce tasks that are being executed or have been finished in the clusters, including their status, start and end time, and run logs. +- Oozie: Hue provides the Oozie job manager function, in this case, you can use Oozie in GUI mode. +- ZooKeeper: Hue provides the ZooKeeper browser function for you to use ZooKeeper in GUI mode. + +For details about Hue, visit `https://gethue.com/ `__. + +Architecture +------------ + +Hue, adopting the MTV (Model-Template-View) design, is a web application program running on Django Python. (Django Python is a web application framework that uses open source codes.) + +Hue consists of Supervisor Process and WebServer. Supervisor Process is the core Hue process that manages application processes. Supervisor Process and WebServer interact with applications on WebServer through Thrift/REST APIs, as shown in :ref:`Figure 1 `. + +.. _mrs_08_00121__fig53047075153153: + +.. figure:: /_static/images/en-us_image_0000001296750298.png + :alt: **Figure 1** Hue architecture + + **Figure 1** Hue architecture + +:ref:`Table 1 ` describes the components shown in :ref:`Figure 1 `. + +.. _mrs_08_00121__table10504539153153: + +.. table:: **Table 1** Architecture description + + +-----------------------------------+--------------------------------------------------------------------------------------------------------+ + | Connection Name | Description | + +===================================+========================================================================================================+ + | Supervisor Process | Manages processes of WebServer applications, such as starting, stopping, and monitoring the processes. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------+ + | Hue WebServer | Provides the following functions through the Django Python web framework: | + | | | + | | - Deploys applications. | + | | - Provides the GUI. | + | | - Connects to databases to store persistent data of applications. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/overview/components/hue/hue_enhanced_open_source_features.rst b/umn/source/overview/components/hue/hue_enhanced_open_source_features.rst new file mode 100644 index 0000000..207d38b --- /dev/null +++ b/umn/source/overview/components/hue/hue_enhanced_open_source_features.rst @@ -0,0 +1,15 @@ +:original_name: mrs_08_00123.html + +.. _mrs_08_00123: + +Hue Enhanced Open Source Features +================================= + + +Hue Enhanced Open Source Features +--------------------------------- + +- Storage policy: The number of HDFS file copies varies depending on the storage media. This feature allows you to manually set an HDFS directory storage policy or can automatically adjust the file storage policy, modify the number of file copies, move the file directory, and delete files based on the latest access time and modification time of HDFS files to fully utilize storage capacity and improve storage performance. + +- MR engine: You can use the MapReduce engine to execute Hive SQL statements. +- Reliability enhancement: Hue is deployed in active/standby mode. When interconnecting with HDFS, Oozie, Hive, and YARN, Hue can work in failover or load balancing mode. diff --git a/umn/source/overview/components/hue/index.rst b/umn/source/overview/components/hue/index.rst new file mode 100644 index 0000000..db743af --- /dev/null +++ b/umn/source/overview/components/hue/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0012.html + +.. _mrs_08_0012: + +Hue +=== + +- :ref:`Hue Basic Principles ` +- :ref:`Relationship Between Hue and Other Components ` +- :ref:`Hue Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + hue_basic_principles + relationship_between_hue_and_other_components + hue_enhanced_open_source_features diff --git a/umn/source/overview/components/hue/relationship_between_hue_and_other_components.rst b/umn/source/overview/components/hue/relationship_between_hue_and_other_components.rst new file mode 100644 index 0000000..1d09855 --- /dev/null +++ b/umn/source/overview/components/hue/relationship_between_hue_and_other_components.rst @@ -0,0 +1,39 @@ +:original_name: mrs_08_00122.html + +.. _mrs_08_00122: + +Relationship Between Hue and Other Components +============================================= + +Relationship Between Hue and Hadoop Clusters +-------------------------------------------- + +:ref:`Table 1 ` shows how Hue interacts with Hadoop clusters. + +.. _mrs_08_00122__table11348180: + +.. table:: **Table 1** Relationship Between Hue and Other Components + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Connection Name | Description | + +===================================+======================================================================================================================================================================================================================================================================================+ + | HDFS | HDFS provides REST APIs to interact with Hue to query and operate HDFS files. | + | | | + | | Hue packages a user request into interface data, sends the request to HDFS through REST APIs, and displays execution results on the web UI. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hive | Hive provides Thrift interfaces to interact with Hue, execute Hive SQL statements, and query table metadata. | + | | | + | | If you edit HQL statements on the Hue web UI, then, Hue submits the HQL statements to the Hive server through the Thrift APIs and displays execution results on the web UI. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | YARN/MapReduce | MapReduce provides REST APIs to interact with Hue and query YARN job information. | + | | | + | | If you go to the Hue web UI, enter the filter parameters, the UI sends the parameters to the background, and Hue invokes the REST APIs provided by MapReduce (MR1/MR2-YARN) to obtain information such as the status of the task running, the start/end time, the run log, and more. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Oozie | Oozie provides REST APIs to interact with Hue, create workflows, coordinators, and bundles, and manage and monitor tasks. | + | | | + | | A graphical workflow, coordinator, and bundle editor are provided on the Hue web UI. Hue invokes the REST APIs of Oozie to create, modify, delete, submit, and monitor workflows, coordinators, and bundles. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ZooKeeper | ZooKeeper provides REST APIs to interact with Hue and query ZooKeeper node information. | + | | | + | | ZooKeeper node information is displayed in the Hue web UI. Hue invokes the REST APIs of ZooKeeper to obtain the node information. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/overview/components/index.rst b/umn/source/overview/components/index.rst new file mode 100644 index 0000000..0a22125 --- /dev/null +++ b/umn/source/overview/components/index.rst @@ -0,0 +1,68 @@ +:original_name: mrs_08_0052.html + +.. _mrs_08_0052: + +Components +========== + +- :ref:`Alluxio ` +- :ref:`CarbonData ` +- :ref:`ClickHouse ` +- :ref:`DBService ` +- :ref:`Flink ` +- :ref:`Flume ` +- :ref:`HBase ` +- :ref:`HDFS ` +- :ref:`HetuEngine ` +- :ref:`Hive ` +- :ref:`Hudi ` +- :ref:`Hue ` +- :ref:`Kafka ` +- :ref:`KafkaManager ` +- :ref:`KrbServer and LdapServer ` +- :ref:`Loader ` +- :ref:`Manager ` +- :ref:`MapReduce ` +- :ref:`Oozie ` +- :ref:`OpenTSDB ` +- :ref:`Presto ` +- :ref:`Ranger ` +- :ref:`Spark ` +- :ref:`Spark2x ` +- :ref:`Storm ` +- :ref:`Tez ` +- :ref:`Yarn ` +- :ref:`ZooKeeper ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + alluxio + carbondata + clickhouse + dbservice/index + flink/index + flume/index + hbase/index + hdfs/index + hetuengine/index + hive/index + hudi + hue/index + kafka/index + kafkamanager + krbserver_and_ldapserver/index + loader/index + manager/index + mapreduce/index + oozie/index + opentsdb + presto + ranger/index + spark/index + spark2x/index + storm/index + tez + yarn/index + zookeeper/index diff --git a/umn/source/overview/components/kafka/index.rst b/umn/source/overview/components/kafka/index.rst new file mode 100644 index 0000000..477359c --- /dev/null +++ b/umn/source/overview/components/kafka/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0013.html + +.. _mrs_08_0013: + +Kafka +===== + +- :ref:`Kafka Basic Principles ` +- :ref:`Relationship Between Kafka and Other Components ` +- :ref:`Kafka Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + kafka_basic_principles + relationship_between_kafka_and_other_components + kafka_enhanced_open_source_features diff --git a/umn/source/overview/components/kafka/kafka_basic_principles.rst b/umn/source/overview/components/kafka/kafka_basic_principles.rst new file mode 100644 index 0000000..b8b52b7 --- /dev/null +++ b/umn/source/overview/components/kafka/kafka_basic_principles.rst @@ -0,0 +1,91 @@ +:original_name: mrs_08_00131.html + +.. _mrs_08_00131: + +Kafka Basic Principles +====================== + +`Kafka `__ is an open source, distributed, partitioned, and replicated commit log service. Kafka is publish-subscribe messaging, rethought as a distributed commit log. It provides features similar to Java Message Service (JMS) but another design. It features message endurance, high throughput, distributed methods, multi-client support, and real time. It applies to both online and offline message consumption, such as regular message collection, website activeness tracking, aggregation of statistical system operation data (monitoring data), and log collection. These scenarios engage large amounts of data collection for Internet services. + +Kafka Structure +--------------- + +Producers publish data to topics, and consumers subscribe to the topics and consume messages. A broker is a server in a Kafka cluster. For each topic, the Kafka cluster maintains partitions for scalability, parallelism, and fault tolerance. Each partition is an ordered, immutable sequence of messages that is continually appended to - a commit log. Each message in a partition is assigned a sequential ID, which is called offset. + + +.. figure:: /_static/images/en-us_image_0000001349190373.png + :alt: **Figure 1** Kafka architecture + + **Figure 1** Kafka architecture + +.. table:: **Table 1** Kafka architecture description + + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +===========+=================================================================================================================================================================================================================================================================+ + | Broker | A broker is a server in a Kafka cluster. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Topic | A topic is a category or feed name to which messages are published. A topic can be divided into multiple partitions, which can act as a parallel unit. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Partition | A partition is an ordered, immutable sequence of messages that is continually appended to - a commit log. The messages in the partitions are each assigned a sequential ID number called the offset that uniquely identifies each message within the partition. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Producer | Producers publish messages to a Kafka topic. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Consumer | Consumers subscribe to topics and process the feed of published messages. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +:ref:`Figure 2 ` shows the relationships between modules. + +.. _mrs_08_00131__fig158402401262: + +.. figure:: /_static/images/en-us_image_0000001349110501.png + :alt: **Figure 2** Relationships between Kafka modules + + **Figure 2** Relationships between Kafka modules + +Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. If all the consumer instances belong to the same consumer group, loads are evenly distributed among the consumers. As shown in the preceding figure, Consumer1 and Consumer2 work in load-sharing mode; Consumer3, Consumer4, Consumer5, and Consumer6 work in load-sharing mode. If all the consumer instances belong to different consumer groups, messages are broadcast to all consumers. As shown in the preceding figure, the messages in Topic 1 are broadcast to all consumers in Consumer Group1 and Consumer Group2. + +For details about Kafka architecture and principles, see https://kafka.apache.org/24/documentation.html. + +Principle +--------- + +- **Message Reliability** + + When a Kafka broker receives a message, it stores the message on a disk persistently. Each partition of a topic has multiple replicas stored on different broker nodes. If one node is faulty, the replicas on other nodes can be used. + +- **High Throughput** + + Kafka provides high throughput in the following ways: + + - Messages are written into disks instead of being cached in the memory, fully utilizing the sequential read and write performance of disks. + - The use of zero-copy eliminates I/O operations. + - Data is sent in batches, improving network utilization. + - Each topic is divided in to multiple partitions, which increases concurrent processing. Concurrent read and write operations can be performed between multiple producers and consumers. Producers send messages to specified partitions based on the algorithm used. + +- **Message Subscribe-Notify Mechanism** + + Consumers subscribe to interested topics and consume data in pull mode. Consumers can choose the consumption mode, such as batch consumption, repeated consumption, and consumption from the end, and control the message pulling speed based on actual situation. Consumers need to maintain the consumption records by themselves. + +- **Scalability** + + When broker nodes are added to expand the Kafka cluster capacity, the newly added brokers register with ZooKeeper. After the registration is successful, procedures and consumers can sense the change in a timely manner and make related adjustment. + +Open Source Features +-------------------- + +- Reliability + + Message processing methods such as **At-Least Once**, **At-Most Once**, and **Exactly Once** are provided. The message processing status is maintained by consumers. Kafka needs to work with the application layer to implement **Exactly Once**. + +- High throughput + + High throughput is provided for message publishing and subscription. + +- Persistence + + Messages are stored on disks and can be used for batch consumption and real-time application programs. Data persistence and replication prevent data loss. + +- Distribution + + A distributed system is easy to be expanded externally. All producers, brokers, and consumers support the deployment of multiple distributed clusters. Systems can be scaled without stopping the running of software or shutting down the machines. diff --git a/umn/source/overview/components/kafka/kafka_enhanced_open_source_features.rst b/umn/source/overview/components/kafka/kafka_enhanced_open_source_features.rst new file mode 100644 index 0000000..6b3940c --- /dev/null +++ b/umn/source/overview/components/kafka/kafka_enhanced_open_source_features.rst @@ -0,0 +1,23 @@ +:original_name: mrs_08_00133.html + +.. _mrs_08_00133: + +Kafka Enhanced Open Source Features +=================================== + + +Kafka Enhanced Open Source Features +----------------------------------- + +- Monitors the following topic-level metrics: + + - Topic Input Traffic + - Topic Output Traffic + - Topic Rejected Traffic + - Number of Failed Fetch Requests Per Second + - Number of Failed Produce Requests Per Second + - Number of Topic Input Messages Per Second + - Number of Fetch Requests Per Second + - Number of Produce Requests Per Second + +- Queries the mapping between broker IDs and node IP addresses. On Linux clients, **kafka-broker-info.sh** can be used to query the mapping between broker IDs and node IP addresses. diff --git a/umn/source/overview/components/kafka/relationship_between_kafka_and_other_components.rst b/umn/source/overview/components/kafka/relationship_between_kafka_and_other_components.rst new file mode 100644 index 0000000..63f28c1 --- /dev/null +++ b/umn/source/overview/components/kafka/relationship_between_kafka_and_other_components.rst @@ -0,0 +1,14 @@ +:original_name: mrs_08_00132.html + +.. _mrs_08_00132: + +Relationship Between Kafka and Other Components +=============================================== + +As a message publishing and subscription system, Kafka provides high-speed data transmission methods for data transmission between different subsystems of the FusionInsight platform. It can receive external messages in a real-time manner and provides the messages to the online and offline services for processing. The following figure shows the relationship between Kafka and other components. + + +.. figure:: /_static/images/en-us_image_0000001296270858.png + :alt: **Figure 1** Relationship with Other Components + + **Figure 1** Relationship with Other Components diff --git a/umn/source/overview/components/kafkamanager.rst b/umn/source/overview/components/kafkamanager.rst new file mode 100644 index 0000000..b261f12 --- /dev/null +++ b/umn/source/overview/components/kafkamanager.rst @@ -0,0 +1,24 @@ +:original_name: mrs_08_0032.html + +.. _mrs_08_0032: + +KafkaManager +============ + +KafkaManager is a tool for managing Apache Kafka and provides GUI-based metric monitoring and management of Kafka clusters. + +KafkaManager supports the following operations: + +- Manage multiple Kafka clusters. +- Easy inspection of cluster states (topics, consumers, offsets, partitions, replicas, and nodes) +- Run preferred replica election. +- Generate partition assignments with option to select brokers to use. +- Run reassignment of partition (based on generated assignments). +- Create a topic with optional topic configurations (Multiple Kafka cluster versions are supported). +- Delete a topic (only supported on 0.8.2+ and **delete.topic.enable=true** is set in broker configuration). +- Batch generate partition assignments for multiple topics with option to select brokers to use. +- Batch run reassignment of partitions for multiple topics. +- Add partitions to an existing topic. +- Update configurations for an existing topic. +- Optionally enable JMX polling for broker-level and topic-level metrics. +- Optionally filter out consumers that do not have ids/ owner / & offsets/ directories in ZooKeeper. diff --git a/umn/source/overview/components/krbserver_and_ldapserver/index.rst b/umn/source/overview/components/krbserver_and_ldapserver/index.rst new file mode 100644 index 0000000..167dbea --- /dev/null +++ b/umn/source/overview/components/krbserver_and_ldapserver/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0064.html + +.. _mrs_08_0064: + +KrbServer and LdapServer +======================== + +- :ref:`KrbServer and LdapServer Principles ` +- :ref:`KrbServer and LdapServer Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + krbserver_and_ldapserver_principles + krbserver_and_ldapserver_enhanced_open_source_features diff --git a/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_enhanced_open_source_features.rst b/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_enhanced_open_source_features.rst new file mode 100644 index 0000000..687769a --- /dev/null +++ b/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_enhanced_open_source_features.rst @@ -0,0 +1,24 @@ +:original_name: mrs_08_00642.html + +.. _mrs_08_00642: + +KrbServer and LdapServer Enhanced Open Source Features +====================================================== + +Enhanced open-source features of KrbServer and LdapServer: intra-cluster service authentication +----------------------------------------------------------------------------------------------- + +In an MRS cluster that uses the security mode, mutual access between services is implemented based on the Kerberos security architecture. When a service (such as HDFS) in the cluster is to be started, the corresponding sessionkey (keytab, used for identity authentication of the application) is obtained from Kerberos. If another service (such as YARN) needs to access HDFS and add, delete, modify, or query data in HDFS, the corresponding TGT and ST must be obtained for secure access. + +Enhanced Open-Source Features of KrbServer and LdapServer: Application Development Authentication +------------------------------------------------------------------------------------------------- + +MRS components provide application development interfaces for customers or upper-layer service product clusters. During application development, a cluster in security mode provides specified application development authentication interfaces to implement application security authentication and access. For example, the UserGroupInformation class provided by the hadoop-common API provides multiple security authentication APIs. + +- **setConfiguration()** is used to obtain related configuration and set parameters such as global variables. +- **loginUserFromKeytab():** is used to obtain TGT interfaces. + +Enhanced Open-Source Features of KrbServer and LdapServer: Cross-System Mutual Trust +------------------------------------------------------------------------------------ + +MRS provides the mutual trust function between two Managers to implement data read and write operations between systems. diff --git a/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_principles.rst b/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_principles.rst new file mode 100644 index 0000000..2dca521 --- /dev/null +++ b/umn/source/overview/components/krbserver_and_ldapserver/krbserver_and_ldapserver_principles.rst @@ -0,0 +1,102 @@ +:original_name: mrs_08_00641.html + +.. _mrs_08_00641: + +KrbServer and LdapServer Principles +=================================== + +Overview +-------- + +To manage the access control permissions on data and resources in a cluster, it is recommended that the cluster be installed in security mode. In security mode, a client application must be authenticated and a secure session must be established before the application accesses any resource in the cluster. MRS uses KrbServer to provide Kerberos authentication for all components, implementing a reliable authentication mechanism. + +LdapServer supports Lightweight Directory Access Protocol (LDAP) and provides the capability of storing user and user group data for Kerberos authentication. + +Architecture +------------ + +The security authentication function for user login depends on Kerberos and LDAP. + +.. _mrs_08_00641__fig6453372512431: + +.. figure:: /_static/images/en-us_image_0000001349309949.png + :alt: **Figure 1** Security authentication architecture + + **Figure 1** Security authentication architecture + +:ref:`Figure 1 ` includes three scenarios: + +- Logging in to the MRS Manager Web UI + + The authentication architecture includes steps 1, 2, 3, and 4. + +- Logging in to a component web UI + + The authentication architecture includes steps 5, 6, 7, and 8. + +- Accessing between components + + The authentication architecture includes step 9. + +.. table:: **Table 1** Key modules + + +-----------------+-------------------------------------------------------------------------------------+ + | Connection Name | Description | + +=================+=====================================================================================+ + | Manager | Cluster Manager | + +-----------------+-------------------------------------------------------------------------------------+ + | Manager WS | WebBrowser | + +-----------------+-------------------------------------------------------------------------------------+ + | Kerberos1 | KrbServer (management plane) service deployed in MRS Manager, that is, OMS Kerberos | + +-----------------+-------------------------------------------------------------------------------------+ + | Kerberos2 | KrbServer (service plane) service deployed in the cluster | + +-----------------+-------------------------------------------------------------------------------------+ + | LDAP1 | LdapServer (management plane) service deployed in MRS Manager, that is, OMS LDAP | + +-----------------+-------------------------------------------------------------------------------------+ + | LDAP2 | LdapServer (service plane) service deployed in the cluster | + +-----------------+-------------------------------------------------------------------------------------+ + +Data operation mode of Kerberos1 in LDAP: The active and standby instances of LDAP1 and the two standby instances of LDAP2 can be accessed in load balancing mode. Data write operations can be performed only in the active LDAP1 instance. Data read operations can be performed in LDAP1 or LDAP2. + +Data operation mode of Kerberos2 in LDAP: Data read operations can be performed in LDAP1 and LDAP2. Data write operations can be performed only in the active LDAP1 instance. + +Principle +--------- + +**Kerberos authentication** + + +.. figure:: /_static/images/en-us_image_0000001349390661.png + :alt: **Figure 2** Authentication process + + **Figure 2** Authentication process + +**LDAP data read and write** + + +.. figure:: /_static/images/en-us_image_0000001296750266.png + :alt: **Figure 3** Data modification process + + **Figure 3** Data modification process + +**LDAP data synchronization** + +- OMS LDAP data synchronization before cluster installation + + + .. figure:: /_static/images/en-us_image_0000001296590650.png + :alt: **Figure 4** OMS LDAP data synchronization + + **Figure 4** OMS LDAP data synchronization + + Data synchronization direction before cluster installation: Data is synchronized from the active OMS LDAP to the standby OMS LDAP. + +- LDAP data synchronization after cluster installation + + + .. figure:: /_static/images/en-us_image_0000001296270830.png + :alt: **Figure 5** LDAP data synchronization + + **Figure 5** LDAP data synchronization + + Data synchronization direction after cluster installation: Data is synchronized from the active OMS LDAP to the standby OMS LDAP, standby component LDAP, and standby component LDAP. diff --git a/umn/source/overview/components/loader/index.rst b/umn/source/overview/components/loader/index.rst new file mode 100644 index 0000000..633919a --- /dev/null +++ b/umn/source/overview/components/loader/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0017.html + +.. _mrs_08_0017: + +Loader +====== + +- :ref:`Loader Basic Principles ` +- :ref:`Relationship Between Loader and Other Components ` +- :ref:`Loader Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + loader_basic_principles + relationship_between_loader_and_other_components + loader_enhanced_open_source_features diff --git a/umn/source/overview/components/loader/loader_basic_principles.rst b/umn/source/overview/components/loader/loader_basic_principles.rst new file mode 100644 index 0000000..ef6bdd3 --- /dev/null +++ b/umn/source/overview/components/loader/loader_basic_principles.rst @@ -0,0 +1,79 @@ +:original_name: mrs_08_00171.html + +.. _mrs_08_00171: + +Loader Basic Principles +======================= + +`Loader `__ is developed based on the open source Sqoop component. It is used to exchange data and files between MRS and relational databases and file systems. Loader can import data from relational databases or file servers to the HDFS and HBase components, or export data from HDFS and HBase to relational databases or file servers. + +A Loader model consists of Loader Client and Loader Server, as shown in :ref:`Figure 1 `. + +.. _mrs_08_00171__fig159938821619: + +.. figure:: /_static/images/en-us_image_0000001349309893.png + :alt: **Figure 1** Loader model + + **Figure 1** Loader model + +:ref:`Table 1 ` describes the functions of each module shown in the preceding figure. + +.. _mrs_08_00171__table691314237323: + +.. table:: **Table 1** Components of the Loader model + + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +=====================+==================================================================================================================================================================+ + | Loader Client | Loader client. It provides two interfaces: web UI and CLI. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Loader Server | Loader server. It processes operation requests sent from the client, manages connectors and metadata, submits MapReduce jobs, and monitors MapReduce job status. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | REST API | It provides a Representational State Transfer (RESTful) APIs (HTTP + JSON) to process the operation requests sent from the client. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Job Scheduler | Simple job scheduler. It periodically executes Loader jobs. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Transform Engine | Data transformation engine. It supports field combination, string cutting, and string reverse. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Execution Engine | Loader job execution engine. It executes Loader jobs in MapReduce manner. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Submission Engine | Loader job submission engine. It submits Loader jobs to MapReduce. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Job Manager | It manages Loader jobs, including creating, querying, updating, deleting, activating, deactivating, starting, and stopping jobs. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Metadata Repository | Metadata repository. It stores and manages data about Loader connectors, transformation procedures, and jobs. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HA Manager | It manages the active/standby status of Loader Server processes. The Loader Server has two nodes that are deployed in active/standby mode. | + +---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Loader imports or exports jobs in parallel using MapReduce jobs. Some job import or export may involve only the Map operations, while some may involve both Map and Reduce operations. + +Loader implements fault tolerance using MapReduce. Jobs can be rescheduled upon a job execution failure. + +- **Importing data to HBase** + + When the Map operation is performed for MapReduce jobs, Loader obtains data from an external data source. + + When a Reduce operation is performed for a MapReduce job, Loader enables the same number of Reduce tasks based on the number of Regions. The Reduce tasks receive data from Map tasks, generate HFiles by Region, and store the HFiles in a temporary directory of HDFS. + + When a MapReduce job is submitted, Loader migrates HFiles from the temporary directory to the HBase directory. + +- **Importing Data to HDFS** + + When a Map operation is performed for a MapReduce job, Loader obtains data from an external data source and exports the data to a temporary directory (named *export directory*\ **-ldtmp**). + + When a MapReduce job is submitted, Loader migrates data from the temporary directory to the output directory. + +- **Exporting data to a relational database** + + When a Map operation is performed for a MapReduce job, Loader obtains data from HDFS or HBase and inserts the data to a temporary table (Staging Table) through the Java DataBase Connectivity (JDBC) API. + + When a MapReduce job is submitted, Loader migrates data from the temporary table to a formal table. + +- **Exporting data to a file system** + + When a Map operation is performed for a MapReduce job, Loader obtains data from HDFS or HBase and writes the data to a temporary directory of the file server. + + When a MapReduce job is submitted, Loader migrates data from the temporary directory to a formal directory. + +For details about the Loader architecture and principles, see https://sqoop.apache.org/docs/1.99.3/index.html. diff --git a/umn/source/overview/components/loader/loader_enhanced_open_source_features.rst b/umn/source/overview/components/loader/loader_enhanced_open_source_features.rst new file mode 100644 index 0000000..7c7af25 --- /dev/null +++ b/umn/source/overview/components/loader/loader_enhanced_open_source_features.rst @@ -0,0 +1,45 @@ +:original_name: mrs_08_00173.html + +.. _mrs_08_00173: + +Loader Enhanced Open Source Features +==================================== + +Loader Enhanced Open-Source Feature: Data Import and Export +----------------------------------------------------------- + +Loader is developed based on Sqoop. In addition to the Sqoop functions, Loader has the following enhanced features: + +- Provides data conversion functions. +- Supports GUI-based configuration conversion. +- Imports data from an SFTP/FTP server to HDFS/OBS. +- Imports data from an SFTP/FTP server to an HBase table. +- Imports data from an SFTP/FTP server to a Phoenix table. +- Imports data from an SFTP/FTP server to a Hive table. +- Exports data from HDFS/OBS to an SFTP/FTP server. +- Exports data from an HBase table to an SFTP/FTP server. +- Exports data from a Phoenix table to an SFTP/FTP server. +- Imports data from a relational database to an HBase table. +- Imports data from a relational database to a Phoenix table. +- Imports data from a relational database to a Hive table. +- Exports data from an HBase table to a relational database. +- Exports data from a Phoenix table to a relational database. +- Imports data from an Oracle partitioned table to HDFS/OBS. +- Imports data from an Oracle partitioned table to an HBase table. +- Imports data from an Oracle partitioned table to a Phoenix table. +- Imports data from an Oracle partitioned table to a Hive table. +- Exports data from HDFS/OBS to an Oracle partitioned table. +- Exports data from HBase to an Oracle partitioned table. +- Exports data from a Phoenix table to an Oracle partitioned table. +- Imports data from HDFS to an HBase table, a Phoenix table, and a Hive table in the same cluster. +- Exports data from an HBase table and a Phoenix table to HDFS/OBS in the same cluster. +- Imports data to an HBase table and a Phoenix table by using **bulkload** or **put list**. +- Imports all types of files from an SFTP/FTP server to HDFS. The open source component Sqoop can import only text files. +- Exports all types of files from HDFS/OBS to an SFTP server. The open source component Sqoop can export only text files and SequenceFile files. +- Supports file coding format conversion during file import and export. The supported coding formats include all formats supported by Java Development Kit (JDK). +- Retains the original directory structure and file names during file import and export. +- Supports file combination during file import and export. For example, if a large number of files are to be imported, these files can be combined into *n* files (*n* can be configured). +- Supports file filtering during file import and export. The filtering rules support wildcards and regular expressions. +- Supports batch import and export of ETL tasks. +- Supports query by page and key word and group management of ETL tasks. +- Provides floating IP addresses for external components. diff --git a/umn/source/overview/components/loader/relationship_between_loader_and_other_components.rst b/umn/source/overview/components/loader/relationship_between_loader_and_other_components.rst new file mode 100644 index 0000000..7470956 --- /dev/null +++ b/umn/source/overview/components/loader/relationship_between_loader_and_other_components.rst @@ -0,0 +1,8 @@ +:original_name: mrs_08_00172.html + +.. _mrs_08_00172: + +Relationship Between Loader and Other Components +================================================ + +The components that interact with Loader include HDFS, HBase, MapReduce, and ZooKeeper. Loader works as a client to use certain functions of these components, such as storing data to HDFS and HBase and reading data from HDFS and HBase tables. In addition, Loader functions as an MapReduce client to import or export data. diff --git a/umn/source/overview/components/manager/index.rst b/umn/source/overview/components/manager/index.rst new file mode 100644 index 0000000..ad7001a --- /dev/null +++ b/umn/source/overview/components/manager/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0066.html + +.. _mrs_08_0066: + +Manager +======= + +- :ref:`Manager Basic Principles ` +- :ref:`Manager Key Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + manager_basic_principles + manager_key_features diff --git a/umn/source/overview/components/manager/manager_basic_principles.rst b/umn/source/overview/components/manager/manager_basic_principles.rst new file mode 100644 index 0000000..4365182 --- /dev/null +++ b/umn/source/overview/components/manager/manager_basic_principles.rst @@ -0,0 +1,98 @@ +:original_name: mrs_08_00661.html + +.. _mrs_08_00661: + +Manager Basic Principles +======================== + +Overview +-------- + +Manager is the O&M management system of MRS and provides unified cluster management capabilities for services deployed in clusters. + +Manager provides functions such as performance monitoring, alarms, user management, permission management, auditing, service management, health check, and log collection. + +Architecture +------------ + +:ref:`Figure 1 ` shows the overall logical architecture of FusionInsight Manager. + +.. _mrs_08_00661__fig17686133154020: + +.. figure:: /_static/images/en-us_image_0000001349390649.png + :alt: **Figure 1** Manager logical architecture + + **Figure 1** Manager logical architecture + +Manager consists of OMS and OMA. + +- OMS: serves as management node in the O&M system. There are two OMS nodes deployed in active/standby mode. +- OMA: managed node in the O&M system. Generally, there are multiple OMA nodes. + +:ref:`Figure 1 ` describes the modules shown in :ref:`Table 1 `. + +.. _mrs_08_00661__table3395731715355: + +.. table:: **Table 1** Service module description + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+==========================================================================================================================================================================================================================================================================+ + | Web Service | A web service deployed under Tomcat, providing HTTPS API of Manager. It is used to access Manager through the web browser. In addition, it provides the northbound access capability based on the Syslog and SNMP protocols. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OMS | Management node of the O&M system. Generally, there are two OMS nodes that work in active/standby mode. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OMA | Managed node in the O&M system. Generally, there are multiple OMA nodes. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Controller | The control center of Manager. It can converge information from all nodes in the cluster and display it to administrators, as well as receive from administrators, and synchronize information to all nodes in the cluster according to the operation instruction range. | + | | | + | | Control process of Manager. It implements various management actions: | + | | | + | | #. The web service delivers various management actions (such as installation, service startup and stop, and configuration modification) to Controller. | + | | #. Controller decomposes the command and delivers the action to each Node Agent, for example, starting a service involves multiple roles and instances. | + | | #. Controller is responsible for monitoring the implementation of each action. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node Agent | Node Agent exists on each cluster node and is an enabler of Manager on a single node. | + | | | + | | - Node Agent represents all the components deployed on the node to interact with Controller, implementing convergence from multiple nodes of a cluster to a single node. | + | | - Node Agent enables Controller to perform all operations on the components deployed on the node. It allows Controller functions to be implemented. | + | | | + | | Node Agent sends heartbeat messages to Controller at an interval of 3 seconds. The interval cannot be configured. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | IAM | Records audit logs. Each non-query operation on the Manager UI has a related audit log. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | PMS | The performance monitoring module. It collects the performance monitoring data on each OMA and provides the query function. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | CEP | Convergence function module. For example, the used disk space of all OMAs is collected as a performance indicator. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FMS | Alarm module. It collects and queries alarms on each OMA. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | OMM Agent | Agent for performance monitoring and alarm reporting on the OMA. It collects performance monitoring data and alarm data on Agent Node. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | CAS | Unified authentication center. When a user logs in to the web service, CAS authenticates the login. The browser automatically redirects the user to the CAS through URLs. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | AOS | Permission management module. It manages the permissions of users and user groups. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ACS | User and user group management module. It manages users and user groups to which users belong. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Kerberos | LDAP is deployed in OMS and a cluster, respectively. | + | | | + | | - OMS Kerberos provides the single sign-on (SSO) and authentication between Controller and Node Agent. | + | | - Kerberos in the cluster provides the user security authentication function for components. The service name is **KrbServer**, which contains two role instances: | + | | | + | | - KerberosServer: is an authentication server that provides security authentication for MRS. | + | | - KerberosAdmin: manages processes of Kerberos users. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Ldap | LDAP is deployed in OMS and a cluster, respectively. | + | | | + | | - OMS LDAP provides data storage for user authentication. | + | | - The LDAP in the cluster functions as the backup of the OMS LDAP. The service name is **LdapServer** and the role instance is **SlapdServer**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Database | Manager database used to store logs and alarms. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | HA | HA management module that manages the active and standby OMSs. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | NTP Server | It synchronizes the system clock of each node in the cluster. | + | | | + | NTP Client | | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/overview/components/manager/manager_key_features.rst b/umn/source/overview/components/manager/manager_key_features.rst new file mode 100644 index 0000000..138756a --- /dev/null +++ b/umn/source/overview/components/manager/manager_key_features.rst @@ -0,0 +1,52 @@ +:original_name: mrs_08_00662.html + +.. _mrs_08_00662: + +Manager Key Features +==================== + +Key Feature: Unified Alarm Monitoring +------------------------------------- + +Manager provides the visualized and convenient alarm monitoring function. Users can quickly obtain key cluster performance indicators, evaluate cluster health status, customize performance indicator display, and convert indicators to alarms. Manager can monitor the running status of all components and report alarms in real time when faults occur. The online help on the GUI allows you to view performance counters and alarm clearance methods to quickly rectify faults. + +Key Feature: Unified User Permission Management +----------------------------------------------- + +Manager provides permission management of components in a unified manner. + +Manager introduces the concept of role and uses role-based access control (RBAC) to manage system permissions. It centrally displays and manages scattered permission functions of each component in the system and organizes the permissions of each component in the form of permission sets (roles) to form a unified system permission concept. By doing so, common users cannot obtain internal permission management details, and permissions become easy for administrators to manage, greatly facilitating permission management and improving user experience. + +Key Feature: SSO +---------------- + +Single sign-on (SSO) is provided between the Manager web UI and component web UI as well as for integration between MRS and third-party systems. + +This function centrally manages and authenticates Manager users and component users. The entire system uses LDAP to manage users and uses Kerberos for authentication. A set of Kerberos and LDAP management mechanisms are used between the OMS and components. SSO (including single sign-on and single sign-out) is implemented through CAS. With SSO, users can easily switch tasks between the Manager web UI, component web UIs, and third-party systems, without switching to another user. + +.. note:: + + - To ensure security, the CAS Server can retain a ticket-granting ticket (TGT) used by a user only for 20 minutes. + - If a user does not perform any operation on the page (including on the Manager web UI and component web UIs) within 20 minutes, the page is automatically locked. + +Key Feature: Automatic Health Check and Inspection +-------------------------------------------------- + +Manager provides users with automatic inspection on system running environments and helps users check and audit system running health by one click, ensuring correct system running and lowering system operation and maintenance costs. After viewing inspection results, you can export reports for archiving and fault analysis. + +Key Feature: Tenant Management +------------------------------ + +Manager introduces the multi-tenant concept. The CPU, memory, and disk resources of a cluster can be integrated into a set. The set is called a tenant. A mode involving different tenants is called multi-tenant mode. + +Manager provides the multi-tenant function, supports a level-based tenant model and allows tenants to be added and deleted dynamically, achieving resource isolation. As a result, it can dynamically manage and configure the computing resources and the storage resources of tenants. + +- The computing resources indicate tenants' Yarn task queue resources. The task queue quota can be modified, and the task queue usage status and statistics can be viewed. +- The storage resources can be stored on HDFS. You can add and delete the HDFS storage directories of tenants, and set the quotas of file quantity and the storage space of the directories. + +As a unified tenant management platform of MRS, MRS Manager allows users to create and manage tenants in clusters based on service requirements. + +- Roles, computing resources, and storage resources are automatically created when tenants are created. By default, all permissions of the new computing resources and storage resources are allocated to a tenant's roles. +- After you have modified the tenant's computing or storage resources, permissions of the tenant's roles are automatically updated. + +Manager also provides the multi-instance function so that users can use the HBase, Hive, or Spark alone in the resource control and service isolation scenario. The multi-instance function is disabled by default and can be manually enabled. diff --git a/umn/source/overview/components/mapreduce/index.rst b/umn/source/overview/components/mapreduce/index.rst new file mode 100644 index 0000000..dd6d696 --- /dev/null +++ b/umn/source/overview/components/mapreduce/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0050.html + +.. _mrs_08_0050: + +MapReduce +========= + +- :ref:`MapReduce Basic Principles ` +- :ref:`Relationship Between MapReduce and Other Components ` +- :ref:`MapReduce Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + mapreduce_basic_principles + relationship_between_mapreduce_and_other_components + mapreduce_enhanced_open_source_features diff --git a/umn/source/overview/components/mapreduce/mapreduce_basic_principles.rst b/umn/source/overview/components/mapreduce/mapreduce_basic_principles.rst new file mode 100644 index 0000000..32cc027 --- /dev/null +++ b/umn/source/overview/components/mapreduce/mapreduce_basic_principles.rst @@ -0,0 +1,32 @@ +:original_name: mrs_08_00501.html + +.. _mrs_08_00501: + +MapReduce Basic Principles +========================== + +MapReduce is the core of Hadoop. As a software architecture proposed by Google, MapReduce is used for parallel computing of large-scale datasets (larger than 1 TB). The concepts "Map" and "Reduce" and their main thoughts are borrowed from functional programming language and also borrowed from the features of vector programming language. + +Current software implementation is as follows: Specify a Map function to map a series of key-value pairs into a new series of key-value pairs, and specify a Reduce function to ensure that all values in the mapped key-value pairs share the same key. + + +.. figure:: /_static/images/en-us_image_0000001296750206.png + :alt: **Figure 1** Distributed batch processing engine + + **Figure 1** Distributed batch processing engine + +MapReduce is a software framework for processing large datasets in parallel. The root of MapReduce is the Map and Reduce functions in functional programming. The Map function accepts a group of data and transforms it into a key-value pair list. Each element in the input domain corresponds to a key-value pair. The Reduce function accepts the list generated by the Map function, and then shrinks the key-value pair list based on the keys. MapReduce divides a task into multiple parts and allocates them to different devices for processing. In this way, the task can be finished in a distributed environment instead of a single powerful server. + +For more information, see `MapReduce Tutorial `__. + +MapReduce structure +------------------- + +As shown in :ref:`Figure 2 `, MapReduce is integrated into YARN through the Client and ApplicationMaster interfaces of YARN, and uses YARN to apply for computing resources. + +.. _mrs_08_00501__f6e8934d9ed084e648008f4be941e4ab8: + +.. figure:: /_static/images/en-us_image_0000001296590594.png + :alt: **Figure 2** Basic architecture of Apache YARN and MapReduce + + **Figure 2** Basic architecture of Apache YARN and MapReduce diff --git a/umn/source/overview/components/mapreduce/mapreduce_enhanced_open_source_features.rst b/umn/source/overview/components/mapreduce/mapreduce_enhanced_open_source_features.rst new file mode 100644 index 0000000..edb6168 --- /dev/null +++ b/umn/source/overview/components/mapreduce/mapreduce_enhanced_open_source_features.rst @@ -0,0 +1,72 @@ +:original_name: mrs_08_00503.html + +.. _mrs_08_00503: + +MapReduce Enhanced Open Source Features +======================================= + +MapReduce Enhanced Open-Source Feature: JobHistoryServer HA +----------------------------------------------------------- + +JobHistoryServer (JHS) is the server used to view historical MapReduce task information. Currently, the open source JHS supports only single-instance services. JHS HA can solve the problem that an application fails to access the MapReduce API when SPOFs occur on the JHS, which causes the application fails to be executed. This greatly improves the high availability of the MapReduce service. + + +.. figure:: /_static/images/en-us_image_0000001296590602.png + :alt: **Figure 1** Status transition of the JobHistoryServer HA active/standby switchover + + **Figure 1** Status transition of the JobHistoryServer HA active/standby switchover + +**JobHistoryServer High Availability** + +- ZooKeeper is used to implement active/standby election and switchover. +- JHS uses the floating IP address to provide services externally. +- Both the JHS single-instance and HA deployment modes are supported. +- Only one node starts the JHS process at a time point to prevent multiple JHS operations from processing the same file. +- You can perform scale-out, scale-in, instance migration, upgrade, and health check. + +Enhanced Open Source Feature: Improving MapReduce Performance by Optimizing the Merge/Sort Process in Specific Scenarios +------------------------------------------------------------------------------------------------------------------------ + +The figure below shows the workflow of a MapReduce task. + + +.. figure:: /_static/images/en-us_image_0000001349390613.png + :alt: **Figure 2** MapReduce job + + **Figure 2** MapReduce job + + +.. figure:: /_static/images/en-us_image_0000001349190317.png + :alt: **Figure 3** MapReduce job execution flow + + **Figure 3** MapReduce job execution flow + +The Reduce process is divided into three different steps: Copy, Sort (actually supposed to be called Merge), and Reduce. In Copy phase, Reducer tries to fetch the output of Maps from NodeManagers and store it on Reducer either in memory or on disk. Shuffle (Sort and Merge) phase then begins. All the fetched map outputs are being sorted, and segments from different map outputs are merged before being sent to Reducer. When a job has a large number of maps to be processed, the shuffle process is time-consuming. For specific tasks (for example, SQL tasks such as hash join and hash aggregation), sorting is not mandatory during the shuffle process. However, the sorting is required by default in the shuffle process. + +This feature is enhanced by using the MapReduce API, which can automatically close the Sort process for such tasks. When the sorting is disabled, the API directly merges the fetched Maps output data and sends the data to Reducer. This greatly saves time, and significantly improves the efficiency of SQL tasks. + +Enhanced Open Source Feature: Small Log File Problem Solved After Optimization of MR History Server +--------------------------------------------------------------------------------------------------- + +After the job running on Yarn is executed, NodeManager uses LogAggregationService to collect and send generated logs to HDFS and deletes them from the local file system. After the logs are stored to HDFS, they are managed by MR HistoryServer. LogAggregationService will merge local logs generated by containers to a log file and upload it to the HDFS, reducing the number of log files to some extent. However, in a large-scale and busy cluster, there will be excessive log files on HDFS after long-term running. + +For example, if there are 20 nodes, about 18 million log files are generated within the default clean-up period (15 days), which occupy about 18 GB of the memory of a NameNode and slow down the HDFS system response. + +Only the reading and deletion are required for files stored on HDFS. Therefore, Hadoop Archives can be used to periodically archive the directory of collected log files. + +**Archiving Logs** + +The AggregatedLogArchiveService module is added to MR HistoryServer to periodically check the number of files in the log directory. When the number of files reaches the threshold, AggregatedLogArchiveService starts an archiving task to archive log files. After archiving, it deletes the original log files to reduce log files on HDFS. + +**Cleaning Archived Logs** + +Hadoop Archives does not support deletion in archived files. Therefore, the entire archive log package must be deleted upon log clean-up. The latest log generation time is obtained by modifying the AggregatedLogDeletionService module. If all log files meet the clean-up requirements, the archive log package can be deleted. + +**Browsing Archived Logs** + +Hadoop Archives allows URI-based access to file content in the archive log package. Therefore, if MR History Server detects that the original log files do not exist during file browsing, it directly redirects the URI to the archive log package to access the archived log file. + +.. note:: + + - This function invokes Hadoop Archives of HDFS for log archiving. Because the execution of an archiving task by Hadoop Archives is to run an MR application. Therefore, after an archiving task is executed, an MR execution record is added. + - This function of archiving logs is based on the log collection function. Therefore, this function is valid only when the log collection function is enabled. diff --git a/umn/source/overview/components/mapreduce/relationship_between_mapreduce_and_other_components.rst b/umn/source/overview/components/mapreduce/relationship_between_mapreduce_and_other_components.rst new file mode 100644 index 0000000..01aea65 --- /dev/null +++ b/umn/source/overview/components/mapreduce/relationship_between_mapreduce_and_other_components.rst @@ -0,0 +1,17 @@ +:original_name: mrs_08_00502.html + +.. _mrs_08_00502: + +Relationship Between MapReduce and Other Components +=================================================== + +Relationship Between MapReduce and HDFS +--------------------------------------- + +- HDFS features high fault tolerance and high throughput, and can be deployed on low-cost hardware for storing data of applications with massive data sets. +- MapReduce is a programming model used for parallel computation of large data sets (larger than 1 TB). Data computed by MapReduce comes from multiple data sources, such as Local FileSystem, HDFS, and databases. Most data comes from the HDFS. The high throughput of HDFS can be used to read massive data. After being computed, data can be stored in HDFS. + +Relationship Between MapReduce and Yarn +--------------------------------------- + +MapReduce is a computing framework running on Yarn, which is used for batch processing. MRv1 is implemented based on MapReduce in Hadoop 1.0, which is composed of programming models (new and old programming APIs), running environment (JobTracker and TaskTracker), and data processing engine (MapTask and ReduceTask). This framework is still weak in scalability, fault tolerance (JobTracker SPOF), and compatibility with multiple frameworks. (Currently, only the MapReduce computing framework is supported.) MRv2 is implemented based on MapReduce in Hadoop 2.0. The source code reuses MRv1 programming models and data processing engine implementation, and the running environment is composed of ResourceManager and ApplicationMaster. ResourceManager is a brand new resource manager system, and ApplicationMaster is responsible for cutting MapReduce job data, assigning tasks, applying for resources, scheduling tasks, and tolerating faults. diff --git a/umn/source/overview/components/oozie/index.rst b/umn/source/overview/components/oozie/index.rst new file mode 100644 index 0000000..bae61d9 --- /dev/null +++ b/umn/source/overview/components/oozie/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0067.html + +.. _mrs_08_0067: + +Oozie +===== + +- :ref:`Oozie Basic Principles ` +- :ref:`Oozie Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + oozie_basic_principles + oozie_enhanced_open_source_features diff --git a/umn/source/overview/components/oozie/oozie_basic_principles.rst b/umn/source/overview/components/oozie/oozie_basic_principles.rst new file mode 100644 index 0000000..cb718e6 --- /dev/null +++ b/umn/source/overview/components/oozie/oozie_basic_principles.rst @@ -0,0 +1,64 @@ +:original_name: mrs_08_00671.html + +.. _mrs_08_00671: + +Oozie Basic Principles +====================== + +Introduction to Oozie +--------------------- + +`Oozie `__ is an open-source workflow engine that is used to schedule and coordinate Hadoop jobs. + +Architecture +------------ + +The Oozie engine is a web application integrated into Tomcat by default. Oozie uses PostgreSQL databases. + +Oozie provides an Ext-based web console, through which users can view and monitor Oozie workflows. Oozie provides an external REST web service API for the Oozie client to control workflows (such as starting and stopping operations), and orchestrate and run Hadoop MapReduce tasks. For details, see :ref:`Figure 1 `. + +.. _mrs_08_00671__it_hd_des_000065_mmccppss_fig_01: + +.. figure:: /_static/images/en-us_image_0000001296590678.png + :alt: **Figure 1** Oozie architecture + + **Figure 1** Oozie architecture + +:ref:`Table 1 ` describes the functions of each module shown in :ref:`Figure 1 `. + +.. _mrs_08_00671__tab01: + +.. table:: **Table 1** Architecture description + + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Connection Name | Description | + +===================+====================================================================================================================================================================================================================================+ + | Console | Allows users to view and monitor Oozie workflows. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Client | Controls workflows, including submitting, starting, running, planting, and restoring workflows, through APIs. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SDK | Is short for software development kit. An SDK is a set of development tools used by software engineers to establish applications for particular software packages, software frameworks, hardware platforms, and operating systems. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Database | PostgreSQL database | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | WebApp (Oozie) | Functions as the Oozie server. It can be deployed on a built-in or an external Tomcat container. Information recorded by WebApp (Oozie) including logs is stored in the PostgreSQL database. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tomcat | A free open-source web application server | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Hadoop components | Underlying components, such as MapReduce and Hive, that execute the workflows orchestrated by Oozie. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Principle +--------- + +Oozie is a workflow engine server that runs MapReduce workflows. It is also a Java web application running in a Tomcat container. + +Oozie workflows are constructed using Hadoop Process Definition Language (HPDL). HPDL is an XML-defined language, similar to JBoss jBPM Process Definition Language (jPDL). An Oozie workflow consists of the Control Node and Action Node. + +- Control Node controls workflow orchestration, such as **start**, **end**, **error**, **decision**, **fork**, and **join**. + +- An Oozie workflow contains multiple Action Nodes, such as MapReduce and Java. + + All Action Nodes are deployed and run in Direct Acyclic Graph (DAG) mode. Therefore, Action Nodes run in direction. That is, the next Action Node can run only when the running of the previous Action Node ends. When one Action Node ends, the remote server calls back the Oozie interface. Then Oozie executes the next Action Node of workflow in the same manner until all Action Nodes are executed (execution failures are counted). + +Oozie workflows provide various types of Action Nodes, such as MapReduce, Hadoop distributed file system (HDFS), Secure Shell (SSH), Java, and Oozie sub-flows, to support a wide range of business requirements. diff --git a/umn/source/overview/components/oozie/oozie_enhanced_open_source_features.rst b/umn/source/overview/components/oozie/oozie_enhanced_open_source_features.rst new file mode 100644 index 0000000..bd7960f --- /dev/null +++ b/umn/source/overview/components/oozie/oozie_enhanced_open_source_features.rst @@ -0,0 +1,13 @@ +:original_name: mrs_08_00672.html + +.. _mrs_08_00672: + +Oozie Enhanced Open Source Features +=================================== + +Enhanced Open Source Feature: Improved Security +----------------------------------------------- + +Provides roles of administrator and common users to support Oozie permission management. + +Supports single sign-on and sign-out, HTTPS access, and audit logs. diff --git a/umn/source/overview/components/opentsdb.rst b/umn/source/overview/components/opentsdb.rst new file mode 100644 index 0000000..301fe70 --- /dev/null +++ b/umn/source/overview/components/opentsdb.rst @@ -0,0 +1,24 @@ +:original_name: mrs_08_0035.html + +.. _mrs_08_0035: + +OpenTSDB +======== + +OpenTSDB is a distributed, scalable time series database based on HBase. OpenTSDB is designed to collect monitoring information of a large-scale cluster and implement second-level data query, eliminating the limitations of querying and storing massive amounts of monitoring data in common databases. + +OpenTSDB consists of a Time Series Daemon (TSD) as well as a set of command line utilities. Interaction with OpenTSDB is primarily implemented by running one or more TSDs. Each TSD is independent. There is no master server and no shared state, so you can run as many TSDs as required to handle any load you throw at it. Each TSD uses HBase in a CloudTable cluster to store and retrieve time series data. The data schema is highly optimized for fast aggregations of similar time series to minimize storage space. TSD users never need to directly access the underlying storage. You can communicate with the TSD through an HTTP API. All communications happen on the same port (the TSD figures out the protocol of the client by looking at the first few bytes it receives). + + +.. figure:: /_static/images/en-us_image_0000001296430758.png + :alt: **Figure 1** OpenTSDB architecture + + **Figure 1** OpenTSDB architecture + +Application scenarios of OpenTSDB have the following features: + +- The collected metrics have a unique value at a time point and do not have a complex structure or relationship. +- Monitoring metrics change with time. +- Like HBase, OpenTSDB features high throughput and good scalability. + +OpenTSDB provides an HTTP based application programming interface to enable integration with external systems. Almost all OpenTSDB features are accessible via the API such as querying time series data, managing metadata, and storing data points. For details, visit https://opentsdb.net/docs/build/html/api_http/index.html. diff --git a/umn/source/overview/components/presto.rst b/umn/source/overview/components/presto.rst new file mode 100644 index 0000000..92b2d68 --- /dev/null +++ b/umn/source/overview/components/presto.rst @@ -0,0 +1,29 @@ +:original_name: mrs_08_0031.html + +.. _mrs_08_0031: + +Presto +====== + +Presto is an open source SQL query engine for running interactive analytic queries against data sources of all sizes. It applies to massive structured/semi-structured data analysis, massive multi-dimensional data aggregation/report, ETL, ad-hoc queries, and more scenarios. + +Presto allows querying data where it lives, including HDFS, Hive, HBase, Cassandra, relational databases or even proprietary data stores. A Presto query can combine different data sources to perform data analysis across the data sources. + + +.. figure:: /_static/images/en-us_image_0000001349190389.png + :alt: **Figure 1** Presto architecture + + **Figure 1** Presto architecture + +Presto runs in a cluster in distributed mode and contains one coordinator and multiple worker processes. Query requests are submitted from clients (for example, CLI) to the coordinator. The coordinator parses SQL statements, generates execution plans, and distributes the plans to multiple worker processes for execution. + +For details about Presto, visit https://prestodb.github.io/ or https://prestosql.io/. + +Multiple Presto Instances +------------------------- + +MRS supports the installation of multiple Presto instances for a large-scale cluster by default. That is, multiple Worker instances, such as Worker1, Worker2, and Worker3, are installed on a Core/Task node. Multiple Worker instances interact with the Coordinator to execute computing tasks, greatly improving node resource utilization and computing efficiency. + +Presto multi-instance applies only to the Arm architecture. Currently, a single node supports a maximum of four instances. + +For more Presto deployment information, see https://prestodb.io/docs/current/installation/deployment.html or https://trino.io/docs/current/installation/deployment.html. diff --git a/umn/source/overview/components/ranger/index.rst b/umn/source/overview/components/ranger/index.rst new file mode 100644 index 0000000..622bbd5 --- /dev/null +++ b/umn/source/overview/components/ranger/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_0041.html + +.. _mrs_08_0041: + +Ranger +====== + +- :ref:`Ranger Basic Principles ` +- :ref:`Relationship Between Ranger and Other Components ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + ranger_basic_principles + relationship_between_ranger_and_other_components diff --git a/umn/source/overview/components/ranger/ranger_basic_principles.rst b/umn/source/overview/components/ranger/ranger_basic_principles.rst new file mode 100644 index 0000000..beaa1d8 --- /dev/null +++ b/umn/source/overview/components/ranger/ranger_basic_principles.rst @@ -0,0 +1,52 @@ +:original_name: mrs_08_00411.html + +.. _mrs_08_00411: + +Ranger Basic Principles +======================= + +`Apache Ranger `__ offers a centralized security management framework and supports unified authorization and auditing. It manages fine grained access control over Hadoop and related components, such as Storm, HDFS, Hive, HBase, and Kafka. You can use the front-end web UI console provided by Ranger to configure policies to control users' access to these components. + +:ref:`Figure 1 ` shows the Ranger architecture. + +.. _mrs_08_00411__fig3876155913592: + +.. figure:: /_static/images/en-us_image_0000001349190301.png + :alt: **Figure 1** Ranger structure + + **Figure 1** Ranger structure + +.. table:: **Table 1** Architecture description + + +-----------------+------------------------------------------------------------------------------------------------------------------------------+ + | Connection Name | Description | + +=================+==============================================================================================================================+ + | RangerAdmin | Provides a WebUI and RESTful API to manage policies, users, and auditing. | + +-----------------+------------------------------------------------------------------------------------------------------------------------------+ + | UserSync | Periodically synchronizes user and user group information from an external system and writes the information to RangerAdmin. | + +-----------------+------------------------------------------------------------------------------------------------------------------------------+ + | TagSync | Periodically synchronizes tag information from the external Atlas service and writes the tag information to RangerAdmin. | + +-----------------+------------------------------------------------------------------------------------------------------------------------------+ + +Ranger Principles +----------------- + +- Ranger Plugins + + Ranger provides policy-based access control (PBAC) plug-ins to replace the original authentication plug-ins of the components. Ranger plug-ins are developed based on the authentication interface of the components. Users set permission policies for specified services on the Ranger web UI. Ranger plug-ins periodically update policies from the RangerAdmin and caches them in the local file of the component. When a client request needs to be authenticated, the Ranger plug-in matches the user carried in the request with the policy and then returns an accept or reject message. + +- UserSync User Synchronization + + UserSync periodically synchronizes data from LDAP/Unix to RangerAdmin. In security mode, data is synchronized from LDAP. In non-security mode, data is synchronized from Unix. By default, the incremental synchronization mode is used. In each synchronization period, UserSync updates only new or modified users and user groups. When a user or user group is deleted, UserSync does not synchronize the change to RangerAdmin. That is, the user or user group is not deleted from the RangerAdmin. To improve performance, UserSync does not synchronize user groups to which no user belongs to RangerAdmin. + +- Unified auditing + + Ranger plug-ins can record audit logs. Currently, audit logs can be stored in local files. + +- High reliability + + Ranger supports two RangerAdmins working in active/active mode. Two RangerAdmins provide services at the same time. If either RangerAdmin is faulty, Ranger continues to work. + +- High performance + + Ranger provides the Load-Balance capability. When a user accesses Ranger WebUI using a browser, the Load-Balance automatically selects the RangerAdmin with the lightest load to provide services. diff --git a/umn/source/overview/components/ranger/relationship_between_ranger_and_other_components.rst b/umn/source/overview/components/ranger/relationship_between_ranger_and_other_components.rst new file mode 100644 index 0000000..cf218bb --- /dev/null +++ b/umn/source/overview/components/ranger/relationship_between_ranger_and_other_components.rst @@ -0,0 +1,14 @@ +:original_name: mrs_08_004102.html + +.. _mrs_08_004102: + +Relationship Between Ranger and Other Components +================================================ + +Ranger provides PABC-based authentication plug-ins for components to run on their servers. Ranger currently supports authentication for the following components like HDFS, YARN, Hive, HBase, Kafka, Storm, and Spark2x. More components will be supported in the future. + + +.. figure:: /_static/images/en-us_image_0000001349309953.png + :alt: **Figure 1** Relationship Between Ranger and Other Components + + **Figure 1** Relationship Between Ranger and Other Components diff --git a/umn/source/overview/components/spark/basic_principles_of_spark.rst b/umn/source/overview/components/spark/basic_principles_of_spark.rst new file mode 100644 index 0000000..fd6b363 --- /dev/null +++ b/umn/source/overview/components/spark/basic_principles_of_spark.rst @@ -0,0 +1,514 @@ +:original_name: mrs_08_00081.html + +.. _mrs_08_00081: + +Basic Principles of Spark +========================= + +.. note:: + + The Spark component applies to versions earlier than MRS 3.x. + +Description +----------- + +`Spark `__ is an open source parallel data processing framework. It helps you easily develop unified big data applications and perform offline processing, stream processing, and interactive analysis on data. + +Spark provides a framework featuring fast computing, write, and interactive query. Spark has obvious advantages over Hadoop in terms of performance. Spark uses the in-memory computing mode to avoid I/O bottlenecks in scenarios where multiple tasks in a MapReduce workflow process the same dataset. Spark is implemented by using Scala programming language. Scala enables distributed datasets to be processed in a method that is the same as that of processing local data. In addition to interactive data analysis, Spark supports interactive data mining. Spark adopts in-memory computing, which facilitates iterative computing. By coincidence, iterative computing of the same data is a general problem facing data mining. In addition, Spark can run in Yarn clusters where Hadoop 2.0 is installed. The reason why Spark cannot only retain various features like MapReduce fault tolerance, data localization, and scalability but also ensure high performance and avoid busy disk I/Os is that a memory abstraction structure called Resilient Distributed Dataset (RDD) is created for Spark. + +Original distributed memory abstraction, for example, key-value store and databases, supports small-granularity update of variable status. This requires backup of data or log updates to ensure fault tolerance. Consequently, a large amount of I/O consumption is brought about to data-intensive workflows. For the RDD, it has only one set of restricted APIs and only supports large-granularity update, for example, map and join. In this way, Spark only needs to record the transformation operation logs generated during data establishment to ensure fault tolerance without recording a complete dataset. This data transformation link record is a source for tracing a data set. Generally, parallel applications apply the same computing process for a large dataset. Therefore, the limit to the mentioned large-granularity update is not large. As described in Spark theses, the RDD can function as multiple different computing frameworks, for example, programming models of MapReduce and Pregel. In addition, Spark allows you to explicitly make a data transformation process be persistent on hard disks. Data localization is implemented by allowing you to control data partitions based on the key value of each record. (An obvious advantage of this method is that two copies of data to be associated will be hashed in the same mode.) If memory usage exceeds the physical limit, Spark writes relatively large partitions into hard disks, thereby ensuring scalability. + +Spark has the following features: + +- Fast: The data processing speed of Spark is 10 to 100 times higher than that of MapReduce. +- Easy-to-use: Java, Scala, and Python can be used to simply and quickly compile parallel applications for processing massive amounts of data. Spark provides over 80 operators to help you compile parallel applications. +- Universal: Spark provides many tools, for example, `Spark SQL `__ and `Spark Streaming `__. These tools can be combined flexibly in an application. + +- Integration with Hadoop: Spark can directly run in a Hadoop cluster and read existing Hadoop data. + +The Spark component of MRS has the following advantages: + +- The Spark Streaming component of MRS supports real-time data processing rather than triggering as scheduled. +- The Spark component of MRS provides Structured Streaming and allows you to build streaming applications using the Dataset API. Spark supports exactly-once semantics and inner and outer joins for streams. +- The Spark component of MRS uses **pandas_udf** to replace the original user-defined functions (UDFs) in PySpark to process data, which reduces the processing duration by 60% to 90% (affected by specific operations). +- The Spark component of MRS also supports graph data processing and allows modeling using graphs during graph computing. +- Spark SQL of MRS is compatible with some Hive syntax (based on the 64 SQL statements of the Hive-Test-benchmark test set) and standard SQL syntax (based on the 99 SQL statements of the TPC-DS test set). + +For details about Spark architecture and principles, visit https://spark.apache.org/docs/3.1.1/quick-start.html. + +Architecture +------------ + +:ref:`Figure 1 ` describes the Spark architecture and :ref:`Table 1 ` lists the Spark modules. + +.. _mrs_08_00081__f1193616e3b7f46f7a350672c9d3dec9d: + +.. figure:: /_static/images/en-us_image_0000001296750270.png + :alt: **Figure 1** Spark architecture + + **Figure 1** Spark architecture + +.. _mrs_08_00081__t386bd3ecfd8a42f28d5c5a8f64ca341c: + +.. table:: **Table 1** Basic concepts + + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +=================+====================================================================================================================================================================================================================================================================+ + | Cluster Manager | Cluster manager manages resources in the cluster. Spark supports multiple cluster managers, including Mesos, Yarn, and the Standalone cluster manager that is delivered with Spark. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Application | Spark application. It consists of one Driver Program and multiple executors. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Deploy Mode | Deployment in cluster or client mode. In cluster mode, the driver runs on a node inside the cluster. In client mode, the driver runs on the client (outside the cluster). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Driver Program | The main process of the Spark application. It runs the **main()** function of an application and creates SparkContext. It is used for parsing applications, generating stages, and scheduling tasks to executors. Usually, SparkContext represents Driver Program. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Executor | A process started on a Work Node. It is used to execute tasks, and manage and process the data used in applications. A Spark application usually contains multiple executors. Each executor receives commands from the driver and executes one or multiple tasks. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Worker Node | A node that starts and manages executors and resources in a cluster. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Job | A job consists of multiple concurrent tasks. One action operator (for example, a collect operator) maps to one job. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stage | Each job consists of multiple stages. Each stage is a task set, which is separated by Directed Acyclic Graph (DAG). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task | A task carries the computation unit of the service logics. It is the minimum working unit that can be executed on the Spark platform. An application can be divided into multiple tasks based on the execution plan and computation amount. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Spark Application Running Principle +----------------------------------- + +:ref:`Figure 2 ` shows the Spark application running architecture. The running process is as follows: + +#. An application is running in the cluster as a collection of processes. Driver coordinates the running of the application. +#. To run an application, Driver connects to the cluster manager (such as Standalone, Mesos, and Yarn) to apply for the executor resources, and start ExecutorBackend. The cluster manager schedules resources between different applications. Driver schedules DAGs, divides stages, and generates tasks for the application at the same time. +#. Then, Spark sends the codes of the application (the codes transferred to `SparkContext `__, which is defined by JAR or Python) to an executor. +#. After all tasks are finished, the running of the user application is stopped. + +.. _mrs_08_00081__f871db8b05ff94680980627313c214b86: + +.. figure:: /_static/images/en-us_image_0000001296590654.png + :alt: **Figure 2** Spark application running architecture + + **Figure 2** Spark application running architecture + +:ref:`Figure 3 ` shows the Master and Worker modes adopted by Spark. A user submits an application on the Spark client, and then the scheduler divides a job into multiple tasks and sends the tasks to each Worker for execution. Each Worker reports the computation results to Driver (Master), and then the Driver aggregates and returns the results to the client. + +.. _mrs_08_00081__f4f5e5a0249ce48cdac64737ea635d8da: + +.. figure:: /_static/images/en-us_image_0000001296270838.png + :alt: **Figure 3** Spark Master-Worker mode + + **Figure 3** Spark Master-Worker mode + +Note the following about the architecture: + +- Applications are isolated from each other. + + Each application has an independent executor process, and each executor starts multiple threads to execute tasks in parallel. Whether in terms of scheduling or task running on executors. Each driver independently schedules its own tasks. Different application tasks run on different JVMs, that is, different executors. + +- Different Spark applications do not share data, unless data is stored in the external storage system such as HDFS. + +- You are advised to deploy the Driver program in a location that is close to the Worker node because the Driver program schedules tasks in the cluster. For example, deploy the Driver program on the network where the Worker node is located. + +Spark on YARN can be deployed in two modes: + +- In Yarn-cluster mode, the Spark driver runs inside an ApplicationMaster process which is managed by Yarn in the cluster. After the ApplicationMaster is started, the client can exit without interrupting service running. +- In Yarn-client mode, the driver is started in the client process, and the ApplicationMaster process is used only to apply for resources from the Yarn cluster. + +Spark Streaming Principle +------------------------- + +Spark Streaming is a real-time computing framework built on the Spark, which expands the capability for processing massive streaming data. Currently, Spark supports the following data processing methods: + +- Direct Streaming + + In Direct Streaming approach, Direct API is used to process data. Take Kafka Direct API as an example. Direct API provides offset location that each batch range will read from, which is much simpler than starting a receiver to continuously receive data from Kafka and written data to write-ahead logs (WALs). Then, each batch job is running and the corresponding offset data is ready in Kafka. These offset information can be securely stored in the checkpoint file and read by applications that failed to start. + + + .. figure:: /_static/images/en-us_image_0000001349390673.png + :alt: **Figure 4** Data transmission through Direct Kafka API + + **Figure 4** Data transmission through Direct Kafka API + + After the failure, Spark Streaming can read data from Kafka again and process the data segment. The processing result is the same no matter Spark Streaming fails or not, because the semantic is processed only once. + + Direct API does not need to use the WAL and Receivers, and ensures that each Kafka record is received only once, which is more efficient. In this way, the Spark Streaming and Kafka can be well integrated, making streaming channels be featured with high fault-tolerance, high efficiency, and ease-of-use. Therefore, you are advised to use Direct Streaming to process data. + +- Receiver + + When a Spark Streaming application starts (that is, when the driver starts), the related StreamingContext (the basis of all streaming functions) uses SparkContext to start the receiver to become a long-term running task. These receivers receive and save streaming data to the Spark memory for processing. :ref:`Figure 5 ` shows the data transfer lifecycle. + + .. _mrs_08_00081__f9c8691e22ba04d57bcc3c758ff0138f3: + + .. figure:: /_static/images/en-us_image_0000001296270842.png + :alt: **Figure 5** Data transfer lifecycle + + **Figure 5** Data transfer lifecycle + + #. Receive data (blue arrow). + + Receiver divides a data stream into a series of blocks and stores them in the executor memory. In addition, after WAL is enabled, it writes data to the WAL of the fault-tolerant file system. + + #. Notify the driver (green arrow). + + The metadata in the received block is sent to StreamingContext in the driver. The metadata includes: + + - Block reference ID used to locate the data position in the Executor memory. + - Block data offset information in logs (if the WAL function is enabled). + + #. Process data (red arrow). + + For each batch of data, StreamingContext uses block information to generate resilient distributed datasets (RDDs) and jobs. StreamingContext executes jobs by running tasks to process blocks in the executor memory. + + #. Periodically set checkpoints (orange arrows). + + For fault tolerance, StreamingContext periodically sets checkpoints and saves them to external file systems. + +**Fault Tolerance** + +Spark and its RDD allow seamless processing of failures of any Worker node in the cluster. Spark Streaming is built on top of Spark. Therefore, the Worker node of Spark Streaming also has the same fault tolerance capability. However, Spark Streaming needs to run properly in case of long-time running. Therefore, Spark must be able to recover from faults through the driver process (main process that coordinates all Workers). This poses challenges to the Spark driver fault-tolerance because the Spark driver may be any user application implemented in any computation mode. However, Spark Streaming has internal computation architecture. That is, it periodically executes the same Spark computation in each batch data. Such architecture allows it to periodically store checkpoints to reliable storage space and recover them upon the restart of Driver. + +For source data such as files, the Driver recovery mechanism can ensure zero data loss because all data is stored in a fault-tolerant file system such as HDFS. However, for other data sources such as Kafka and Flume, some received data is cached only in memory and may be lost before being processed. This is caused by the distribution operation mode of Spark applications. When the driver process fails, all executors running in the Cluster Manager, together with all data in the memory, are terminated. To avoid such data loss, the WAL function is added to Spark Streaming. + +WAL is often used in databases and file systems to ensure persistence of any data operation. That is, first record an operation to a persistent log and perform this operation on data. If the operation fails, the system is recovered by reading the log and re-applying the preset operation. The following describes how to use WAL to ensure persistence of received data: + +Receiver is used to receive data from data sources such as Kafka. As a long-time running task in Executor, Receiver receives data, and also confirms received data if supported by data sources. Received data is stored in the Executor memory, and Driver delivers a task to Executor for processing. + +After WAL is enabled, all received data is stored to log files in the fault-tolerant file system. Therefore, the received data does not lose even if Spark Streaming fails. Besides, receiver checks correctness of received data only after the data is pre-written into logs. Data that is cached but not stored can be sent again by data sources after the driver restarts. These two mechanisms ensure zero data loss. That is, all data is recovered from logs or re-sent by data sources. + +To enable the WAL function, perform the following operations: + +- Set **streamingContext.checkpoint** to configure the checkpoint directory, which is an HDFS file path used to store streaming checkpoints and WALs. +- Set **spark.streaming.receiver.writeAheadLog.enable** of SparkConf to **true** (the default value is **false**). + +After WAL is enabled, all receivers have the advantage of recovering from reliable received data. You are advised to disable the multi-replica mechanism because the fault-tolerant file system of WAL may also replicate the data. + +.. note:: + + The data receiving throughput is lowered after WAL is enabled. All data is written into the fault-tolerant file system. As a result, the write throughput of the file system and the network bandwidth for data replication may become the potential bottleneck. To solve this problem, you are advised to create more receivers to increase the degree of data receiving parallelism or use better hardware to improve the throughput of the fault-tolerant file system. + +**Recovery Process** + +When a failed driver is restarted, restart it as follows: + + +.. figure:: /_static/images/en-us_image_0000001296590658.png + :alt: **Figure 6** Computing recovery process + + **Figure 6** Computing recovery process + +#. Recover computing. (Orange arrow) + + Use checkpoint information to restart Driver, reconstruct SparkContext and restart Receiver. + +#. Recover metadata block. (Green arrow) + + This operation ensures that all necessary metadata blocks are recovered to continue the subsequent computing recovery. + +#. Relaunch unfinished jobs. (Red arrow) + + Recovered metadata is used to generate RDDs and corresponding jobs for interrupted batch processing due to failures. + +#. Read block data saved in logs. (Blue arrow) + + Block data is directly read from WALs during execution of the preceding jobs, and therefore all essential data reliably stored in logs is recovered. + +#. Resend unconfirmed data. (Purple arrow) + + Data that is cached but not stored to logs upon failures is re-sent by data sources, because the receiver does not confirm the data. + +Therefore, by using WALs and reliable Receiver, Spark Streaming can avoid input data loss caused by Driver failures. + +SparkSQL and DataSet Principle +------------------------------ + +**SparkSQL** + + +.. figure:: /_static/images/en-us_image_0000001296430802.png + :alt: **Figure 7** SparkSQL and DataSet + + **Figure 7** SparkSQL and DataSet + +Spark SQL is a module for processing structured data. In Spark application, SQL statements or DataSet APIs can be seamlessly used for querying structured data. + +Spark SQL and DataSet also provide a universal method for accessing multiple data sources such as Hive, CSV, Parquet, ORC, JSON, and JDBC. These data sources also allow data interaction. Spark SQL reuses the Hive frontend processing logic and metadata processing module. With the Spark SQL, you can directly query existing Hive data. + +In addition, Spark SQL also provides API, CLI, and JDBC APIs, allowing diverse accesses to the client. + +**Spark SQL Native DDL/DML** + +In Spark 1.5, lots of Data Definition Language (DDL)/Data Manipulation Language (DML) commands are pushed down to and run on the Hive, causing coupling with the Hive and inflexibility such as unexpected error reports and results. + +Spark 3.1.1 realizes command localization and replaces the Hive with Spark SQL Native DDL/DML to run DDL/DML commands. Additionally, the decoupling from the Hive is realized and commands can be customized. + +**DataSet** + +A DataSet is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. + +The DataFrame is a structured and distributed dataset consisting of multiple columns. The DataFrame is equal to a table in the relationship database or the DataFrame in the R/Python. The DataFrame is the most basic concept in the Spark SQL, which can be created by using multiple methods, such as the structured dataset, Hive table, external database or RDD. + +Operations available on DataSets are divided into transformations and actions. + +- A transformation operation can generate a new DataSet, + + for example, **map**, **filter**, **select**, and **aggregate (groupBy)**. + +- An action operation can trigger computation and return results, + + for example, **count**, **show**, or write data to the file system. + +You can use either of the following methods to create a DataSet: + +- The most common way is by pointing Spark to some files on storage systems, using the **read** function available on a SparkSession. + + .. code-block:: + + val people = spark.read.parquet("...").as[Person] // Scala + DataSet people = spark.read().parquet("...").as(Encoders.bean(Person.class));//Java + +- You can also create a DataSet using the transformation operation available on an existing one. + + For example, apply the map operation on an existing DataSet to create a DataSet: + + .. code-block:: + + val names = people.map(_.name) // In Scala: names is Dataset. + Dataset names = people.map((Person p) -> p.name, Encoders.STRING)); // Java + +**CLI and JDBCServer** + +In addition to programming APIs, Spark SQL also provides the CLI/JDBC APIs. + +- Both **spark-shell** and **spark-sql** scripts can provide the CLI for debugging. +- JDBCServer provides JDBC APIs. External systems can directly send JDBC requests to calculate and parse structured data. + +SparkSession Principle +---------------------- + +SparkSession is a unified API for Spark programming and can be regarded as a unified entry for reading data. SparkSession provides a single entry point to perform many operations that were previously scattered across multiple classes, and also provides accessor methods to these older classes to maximize compatibility. + +A SparkSession can be created using a builder pattern. The builder will automatically reuse the existing SparkSession if there is a SparkSession; or create a SparkSession if it does not exist. During I/O transactions, the configuration item settings in the builder are automatically synchronized to Spark and Hadoop. + +.. code-block:: + + import org.apache.spark.sql.SparkSession + val sparkSession = SparkSession.builder + .master("local") + .appName("my-spark-app") + .config("spark.some.config.option", "config-value") + .getOrCreate() + +- SparkSession can be used to execute SQL queries on data and return results as DataFrame. + + .. code-block:: + + sparkSession.sql("select * from person").show + +- SparkSession can be used to set configuration items during running. These configuration items can be replaced with variables in SQL statements. + + .. code-block:: + + sparkSession.conf.set("spark.some.config", "abcd") + sparkSession.conf.get("spark.some.config") + sparkSession.sql("select ${spark.some.config}") + +- SparkSession also includes a "catalog" method that contains methods to work with Metastore (data catalog). After this method is used, a dataset is returned, which can be run using the same Dataset API. + + .. code-block:: + + val tables = sparkSession.catalog.listTables() + val columns = sparkSession.catalog.listColumns("myTable") + +- Underlying SparkContext can be accessed by SparkContext API of SparkSession. + + .. code-block:: + + val sparkContext = sparkSession.sparkContext + +Structured Streaming Principle +------------------------------ + +Structured Streaming is a stream processing engine built on the Spark SQL engine. You can use the Dataset/DataFrame API in Scala, Java, Python, or R to express streaming aggregations, event-time windows, and stream-stream joins. If streaming data is incrementally and continuously produced, Spark SQL will continue to process the data and synchronize the result to the result set. In addition, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpoints and WALs. + +The core of Structured Streaming is to take streaming data as an incremental database table. Similar to the data block processing model, the streaming data processing model applies query operations on a static database table to streaming computing, and Spark uses standard SQL statements for query, to obtain data from the incremental and unbounded table. + + +.. figure:: /_static/images/en-us_image_0000001349390669.png + :alt: **Figure 8** Unbounded table of Structured Streaming + + **Figure 8** Unbounded table of Structured Streaming + +Each query operation will generate a result table. At each trigger interval, updated data will be synchronized to the result table. Whenever the result table is updated, the updated result will be written into an external storage system. + + +.. figure:: /_static/images/en-us_image_0000001296750274.png + :alt: **Figure 9** Structured Streaming data processing model + + **Figure 9** Structured Streaming data processing model + +Storage modes of Structured Streaming at the output phase are as follows: + +- Complete Mode: The updated result sets are written into the external storage system. The write operation is performed by a connector of the external storage system. +- Append Mode: If an interval is triggered, only added data in the result table will be written into an external system. This is applicable only on the queries where existing rows in the result table are not expected to change. +- Update Mode: If an interval is triggered, only updated data in the result table will be written into an external system, which is the difference between the Complete Mode and Update Mode. + +Basic Concepts +-------------- + +- **RDD** + + Resilient Distributed Dataset (RDD) is a core concept of Spark. It indicates a read-only and partitioned distributed dataset. Partial or all data of this dataset can be cached in the memory and reused between computations. + + **RDD Creation** + + - An RDD can be created from the input of HDFS or other storage systems that are compatible with Hadoop. + - A new RDD can be converted from a parent RDD. + - An RDD can be converted from a collection of datasets through encoding. + + **RDD Storage** + + - You can select different storage levels to store an RDD for reuse. (There are 11 storage levels to store an RDD.) + - By default, the RDD is stored in the memory. When the memory is insufficient, the RDD overflows to the disk. + +- **RDD Dependency** + + The RDD dependency includes the narrow dependency and wide dependency. + + + .. figure:: /_static/images/en-us_image_0000001296430806.png + :alt: **Figure 10** RDD dependency + + **Figure 10** RDD dependency + + - **Narrow dependency**: Each partition of the parent RDD is used by at most one partition of the child RDD. + - **Wide dependency**: Partitions of the child RDD depend on all partitions of the parent RDD. + + The narrow dependency facilitates the optimization. Logically, each RDD operator is a fork/join (the join is not the join operator mentioned above but the barrier used to synchronize multiple concurrent tasks); fork the RDD to each partition, and then perform the computation. After the computation, join the results, and then perform the fork/join operation on the next RDD operator. It is uneconomical to directly translate the RDD into physical implementation. The first is that every RDD (even intermediate result) needs to be physicalized into memory or storage, which is time-consuming and occupies much space. The second is that as a global barrier, the join operation is very expensive and the entire join process will be slowed down by the slowest node. If the partitions of the child RDD narrowly depend on that of the parent RDD, the two fork/join processes can be combined to implement classic fusion optimization. If the relationship in the continuous operator sequence is narrow dependency, multiple fork/join processes can be combined to reduce a large number of global barriers and eliminate the physicalization of many RDD intermediate results, which greatly improves the performance. This is called pipeline optimization in Spark. + +- **Transformation and Action (RDD Operations)** + + Operations on RDD include transformation (the return value is an RDD) and action (the return value is not an RDD). :ref:`Figure 11 ` shows the RDD operation process. The transformation is lazy, which indicates that the transformation from one RDD to another RDD is not immediately executed. Spark only records the transformation but does not execute it immediately. The real computation is started only when the action is started. The action returns results or writes the RDD data into the storage system. The action is the driving force for Spark to start the computation. + + .. _mrs_08_00081__f9dd728605ad34d6dbbb494f1a2dac9e8: + + .. figure:: /_static/images/en-us_image_0000001349110505.png + :alt: **Figure 11** RDD operation + + **Figure 11** RDD operation + + The data and operation model of RDD are quite different from those of Scala. + + .. code-block:: + + val file = sc.textFile("hdfs://...") + val errors = file.filter(_.contains("ERROR")) + errors.cache() + errors.count() + + #. The textFile operator reads log files from the HDFS and returns files (as an RDD). + #. The filter operator filters rows with **ERROR** and assigns them to errors (a new RDD). The filter operator is a transformation. + #. The cache operator caches errors for future use. + #. The count operator returns the number of rows of errors. The count operator is an action. + + **Transformation includes the following types:** + + - The RDD elements are regarded as simple elements. + + The input and output has the one-to-one relationship, and the partition structure of the result RDD remains unchanged, for example, map. + + The input and output has the one-to-many relationship, and the partition structure of the result RDD remains unchanged, for example, flatMap (one element becomes a sequence containing multiple elements after map and then flattens to multiple elements). + + The input and output has the one-to-one relationship, but the partition structure of the result RDD changes, for example, union (two RDDs integrates to one RDD, and the number of partitions becomes the sum of the number of partitions of two RDDs) and coalesce (partitions are reduced). + + Operators of some elements are selected from the input, such as filter, distinct (duplicate elements are deleted), subtract (elements only exist in this RDD are retained), and sample (samples are taken). + + - The RDD elements are regarded as key-value pairs. + + Perform the one-to-one calculation on the single RDD, such as mapValues (the partition mode of the source RDD is retained, which is different from map). + + Sort the single RDD, such as sort and partitionBy (partitioning with consistency, which is important to the local optimization). + + Restructure and reduce the single RDD based on key, such as groupByKey and reduceByKey. + + Join and restructure two RDDs based on the key, such as join and cogroup. + + .. note:: + + The later three operations involving sorting are called shuffle operations. + + **Action includes the following types:** + + - Generate scalar configuration items, such as **count** (the number of elements in the returned RDD), **reduce**, **fold/aggregate** (the number of scalar configuration items that are returned), and **take** (the number of elements before the return). + - Generate the Scala collection, such as **collect** (import all elements in the RDD to the Scala collection) and **lookup** (look up all values corresponds to the key). + - Write data to the storage, such as **saveAsTextFile** (which corresponds to the preceding **textFile**). + - Check points, such as the **checkpoint** operator. When Lineage is quite long (which occurs frequently in graphics computation), it takes a long period of time to execute the whole sequence again when a fault occurs. In this case, checkpoint is used as the check point to write the current data to stable storage. + +- **Shuffle** + + Shuffle is a specific phase in the MapReduce framework, which is located between the Map phase and the Reduce phase. If the output results of Map are to be used by Reduce, the output results must be hashed based on a key and distributed to each Reducer. This process is called Shuffle. Shuffle involves the read and write of the disk and the transmission of the network, so that the performance of Shuffle directly affects the operation efficiency of the entire program. + + The figure below shows the entire process of the MapReduce algorithm. + + + .. figure:: /_static/images/en-us_image_0000001349309957.png + :alt: **Figure 12** Algorithm process + + **Figure 12** Algorithm process + + Shuffle is a bridge to connect data. The following describes the implementation of shuffle in Spark. + + Shuffle divides a job of Spark into multiple stages. The former stages contain one or more ShuffleMapTasks, and the last stage contains one or more ResultTasks. + +- **Spark Application Structure** + + The Spark application structure includes the initialized SparkContext and the main program. + + - Initialized SparkContext: constructs the operating environment of the Spark Application. + + Constructs the SparkContext object. The following is an example: + + .. code-block:: + + new SparkContext(master, appName, [SparkHome], [jars]) + + Parameter description: + + **master**: indicates the link string. The link modes include local, Yarn-cluster, and Yarn-client. + + **appName**: indicates the application name. + + **SparkHome**: indicates the directory where Spark is installed in the cluster. + + **jars**: indicates the code and dependency package of an application. + + - Main program: processes data. + + For details about how to submit an application, visit https://spark.apache.org/docs/3.1.1/submitting-applications.html. + +- **Spark Shell Commands** + + The basic Spark shell commands support the submission of Spark applications. The Spark shell commands are as follows: + + .. code-block:: + + ./bin/spark-submit \ + --class \ + --master \ + ... # other options + \ + [application-arguments] + + Parameter description: + + **--class**: indicates the name of the class of a Spark application. + + **--master**: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster. + + **application-jar**: indicates the path of the JAR file of the Spark application. + + **application-arguments**: indicates the parameter required to submit the Spark application. This parameter can be left blank. + +- **Spark JobHistory Server** + + The Spark web UI is used to monitor the details in each phase of the Spark framework of a running or historical Spark job and provide the log display, which helps users to develop, configure, and optimize the job in more fine-grained units. diff --git a/umn/source/overview/components/spark/index.rst b/umn/source/overview/components/spark/index.rst new file mode 100644 index 0000000..4063040 --- /dev/null +++ b/umn/source/overview/components/spark/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0008.html + +.. _mrs_08_0008: + +Spark +===== + +- :ref:`Basic Principles of Spark ` +- :ref:`Spark HA Solution ` +- :ref:`Relationship Among Spark, HDFS, and Yarn ` +- :ref:`Spark Enhanced Open Source Feature: Optimized SQL Query of Cross-Source Data ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + basic_principles_of_spark + spark_ha_solution + relationship_among_spark,_hdfs,_and_yarn + spark_enhanced_open_source_feature_optimized_sql_query_of_cross-source_data diff --git a/umn/source/overview/components/spark/relationship_among_spark,_hdfs,_and_yarn.rst b/umn/source/overview/components/spark/relationship_among_spark,_hdfs,_and_yarn.rst new file mode 100644 index 0000000..88e8300 --- /dev/null +++ b/umn/source/overview/components/spark/relationship_among_spark,_hdfs,_and_yarn.rst @@ -0,0 +1,111 @@ +:original_name: mrs_08_00083.html + +.. _mrs_08_00083: + +Relationship Among Spark, HDFS, and Yarn +======================================== + +Relationship Between Spark and HDFS +----------------------------------- + +Data computed by Spark comes from multiple data sources, such as local files and HDFS. Most data computed by Spark comes from the HDFS. The HDFS can read data in large scale for parallel computing. After being computed, data can be stored in the HDFS. + +Spark involves Driver and Executor. Driver schedules tasks and Executor runs tasks. + +:ref:`Figure 1 ` shows the process of reading a file. + +.. _mrs_08_00083__fa4b797ba877b48a9887616253937f2ff: + +.. figure:: /_static/images/en-us_image_0000001296270798.png + :alt: **Figure 1** File reading process + + **Figure 1** File reading process + +The file reading process is as follows: + +#. Driver interconnects with the HDFS to obtain the information of File A. +#. The HDFS returns the detailed block information about this file. +#. Driver sets a parallel degree based on the block data amount, and creates multiple tasks to read the blocks of this file. +#. Executor runs the tasks and reads the detailed blocks as part of the Resilient Distributed Dataset (RDD). + +:ref:`Figure 2 ` shows the process of writing data to a file. + +.. _mrs_08_00083__fd0c7d400d05a40c1b12b1a5f9921b09a: + +.. figure:: /_static/images/en-us_image_0000001349390629.png + :alt: **Figure 2** File writing process + + **Figure 2** File writing process + +The file writing process is as follows: + +#. .. _mrs_08_00083__la0d49754431847d9ba121414f074589b: + + Driver creates a directory where the file is to be written. + +#. Based on the RDD distribution status, the number of tasks related to data writing is computed, and these tasks are sent to Executor. + +#. Executor runs these tasks, and writes the RDD data to the directory created in :ref:`1 `. + +Relationship Between Spark and Yarn +----------------------------------- + +The Spark computing and scheduling can be implemented using Yarn mode. Spark enjoys the computing resources provided by Yarn clusters and runs tasks in a distributed way. Spark on Yarn has two modes: Yarn-cluster and Yarn-client. + +- Yarn-cluster mode + + :ref:`Figure 3 ` shows the running framework of Spark on Yarn-cluster. + + .. _mrs_08_00083__ffe758bc524de483aa66394ea1cb6cb62: + + .. figure:: /_static/images/en-us_image_0000001296590618.png + :alt: **Figure 3** Spark on Yarn-cluster operation framework + + **Figure 3** Spark on Yarn-cluster operation framework + + Spark on Yarn-cluster implementation process: + + #. The client generates the application information, and then sends the information to ResourceManager. + + #. ResourceManager allocates the first container (ApplicationMaster) to SparkApplication and starts driver on the container. + + #. ApplicationMaster applies for resources from ResourceManager to run the container. + + ResourceManager allocates the container to ApplicationMaster, which communicates with NodeManager, and starts the executor in the obtained container. After the executor is started, it registers with the driver and applies for tasks. + + #. The driver allocates tasks to the executor. + + #. The executor runs tasks and reports the operating status to the driver. + +- Yarn-client mode + + :ref:`Figure 4 ` shows the running framework of Spark on Yarn-cluster. + + .. _mrs_08_00083__f6c6ae283e91d48c98c7d7a2a59379989: + + .. figure:: /_static/images/en-us_image_0000001296430766.png + :alt: **Figure 4** Spark on Yarn-client operation framework + + **Figure 4** Spark on Yarn-client operation framework + + Spark on Yarn-client implementation process: + + .. note:: + + In Yarn-client mode, Driver is deployed on the client and started on the client. In Yarn-client mode, the client of the earlier version is incompatible. You are advised to use the Yarn-cluster mode. + + #. The client sends the Spark application request to ResourceManager, then ResourceManager returns the results. The results include information such as Application ID and the maximum and minimum available resources. The client packages all information required to start ApplicationMaster, and sends the information to ResourceManager. + + #. After receiving the request, ResourceManager finds a proper node for ApplicationMaster and starts it on this node. ApplicationMaster is a role in Yarn, and the process name in Spark is ExecutorLauncher. + + #. Based on the resource requirements of each task, ApplicationMaster can apply for a series of Containers to run tasks from ResourceManager. + + #. After receiving the newly allocated container list (from ResourceManager), ApplicationMaster sends information to the related NodeManagers to start the containers. + + ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers, and starts the executors in the obtained containers. After the executors are started, it registers with drivers and applies for tasks. + + .. note:: + + Running containers are not suspended and resources are not released. + + #. The drivers allocate tasks to the executors. The executor executes tasks and reports the operating status to the driver. diff --git a/umn/source/overview/components/spark/spark_enhanced_open_source_feature_optimized_sql_query_of_cross-source_data.rst b/umn/source/overview/components/spark/spark_enhanced_open_source_feature_optimized_sql_query_of_cross-source_data.rst new file mode 100644 index 0000000..e15b66b --- /dev/null +++ b/umn/source/overview/components/spark/spark_enhanced_open_source_feature_optimized_sql_query_of_cross-source_data.rst @@ -0,0 +1,91 @@ +:original_name: mrs_08_00084.html + +.. _mrs_08_00084: + +Spark Enhanced Open Source Feature: Optimized SQL Query of Cross-Source Data +============================================================================ + +Scenario +-------- + +Enterprises usually store massive data, such as from various databases and warehouses, for management and information collection. However, diversified data sources, hybrid dataset structures, and scattered data storage lower query efficiency. + +The open source Spark only supports simple filter pushdown during querying of multi-source data. The SQL engine performance is deteriorated due of a large amount of unnecessary data transmission. The pushdown function is enhanced, so that **aggregate**, complex **projection**, and complex **predicate** can be pushed to data sources, reducing unnecessary data transmission and improving query performance. + +Only the JDBC data source supports pushdown of query operations, such as **aggregate**, **projection**, **predicate**, **aggregate over inner join**, and **aggregate over union all**. All pushdown operations can be enabled based on your requirements. + +.. table:: **Table 1** Enhanced query of cross-source query + + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Before Enhancement | After Enhancement | + +===========================+====================================================================================================================================+===============================================================================================================================================================================================================================================================+ + | aggregate | The pushdown of **aggregate** is not supported. | - Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | | + | | | Example: select count(``*``) from table | + | | | | + | | | - Internal expressions of aggregation functions are supported. | + | | | | + | | | Example: select sum(a+b) from table | + | | | | + | | | - Calculation of aggregation functions is supported. Example: select avg(a) + max(b) from table | + | | | | + | | | - Pushdown of **having** is supported. | + | | | | + | | | Example: select sum(a) from table where a>0 group by b having sum(a)>10 | + | | | | + | | | - Pushdown of some functions is supported. | + | | | | + | | | Pushdown of lines in mathematics, time, and string functions, such as **abs()**, **month()**, and **length()** are supported. In addition to the preceding built-in functions, you can run the **SET** command to add functions supported by data sources. | + | | | | + | | | Example: select sum(abs(a)) from table | + | | | | + | | | - Pushdown of **limit** and **order by** after **aggregate** is supported. However, the pushdown is not supported in Oracle, because Oracle does not support **limit**. | + | | | | + | | | Example: select sum(a) from table where a>0 group by b order by sum(a) limit 5 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | projection | Only pushdown of simple **projection** is supported. Example: select a, b from table | - Complex expressions can be pushed down. | + | | | | + | | | Example: select (a+b)*c from table | + | | | | + | | | - Some functions can be pushed down. For details, see the description below the table. | + | | | | + | | | Example: select length(a)+abs(b) from table | + | | | | + | | | - Pushdown of **limit** and **order by** after **projection** is supported. | + | | | | + | | | Example: select a, b+c from table order by a limit 3 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | predicate | Only simple filtering with the column name on the left of the operator and values on the right is supported. Example: | - Complex expression pushdown is supported. | + | | | | + | | select \* from table where a>0 or b in ("aaa", "bbb") | Example: select \* from table where a+b>c*d or a/c in (1, 2, 3) | + | | | | + | | | - Some functions can be pushed down. For details, see the description below the table. | + | | | | + | | | Example: select \* from table where length(a)>5 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | aggregate over inner join | Related data from the two tables must be loaded to Spark. The join operation must be performed before the **aggregate** operation. | The following functions are supported: | + | | | | + | | | - Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | - All **aggregate** operations can be performed in a same table. The **group by** operations can be performed on one or two tables and only inner join is supported. | + | | | | + | | | The following scenarios are not supported: | + | | | | + | | | - **aggregate** cannot be pushed down from both the left- and right-join tables. | + | | | - **aggregate** contains operations, for example, sum(a+b). | + | | | - **aggregate** operations, for example, sum(a)+min(b). | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | aggregate over union all | Related data from the two tables must be loaded to Spark. **union** must be performed before **aggregate**. | Supported scenarios: | + | | | | + | | | Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | | + | | | Unsupported scenarios: | + | | | | + | | | - **aggregate** contains operations, for example, sum(a+b). | + | | | - **aggregate** operations, for example, sum(a)+min(b). | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +- If external data source is Hive, query operation cannot be performed on foreign tables created by Spark. +- Only MySQL and MPPDB data sources are supported. diff --git a/umn/source/overview/components/spark/spark_ha_solution.rst b/umn/source/overview/components/spark/spark_ha_solution.rst new file mode 100644 index 0000000..5220cd3 --- /dev/null +++ b/umn/source/overview/components/spark/spark_ha_solution.rst @@ -0,0 +1,200 @@ +:original_name: mrs_08_00082.html + +.. _mrs_08_00082: + +Spark HA Solution +================= + +Spark Multi-Active Instance HA Principles and Implementation Solution +--------------------------------------------------------------------- + +Based on existing JDBCServer in the community, multi-active-instance mode is used to achieve HA. In this mode, multiple JDBCServers coexist in the cluster and the client can randomly connect any JDBCServer to perform service operations. When one or multiple JDBCServers stop working, a client can connect to another normal JDBCServer. + +Compared with active/standby HA mode, multi-active instance mode has following advantages: + +- In active/standby HA, when the active/standby switchover occurs, the unavailable period cannot be controlled by JDBCServer, but it depends on Yarn service resources. +- In Spark, the Thrift JDBC similar to HiveServer2 provides services and users access services through Beeline and JDBC API. Therefore, the processing capability of the JDBCServer cluster depends on the single-point capability of the primary server, and the scalability is insufficient. + +The multi-active instance HA mode not only can prevent service interruption caused by switchover, but also enables cluster scale-out to improve high concurrency. + +- **Implementation** + + The following figure shows the basic principle of multi-active instance HA of Spark JDBCServer. + + + .. figure:: /_static/images/en-us_image_0000001349110489.png + :alt: **Figure 1** Spark JDBCServer HA + + **Figure 1** Spark JDBCServer HA + +#. When a JDBCServer is started, it registers with ZooKeeper by writing node information in a specified directory. Node information includes the instance IP address, port number, version, and serial number. +#. To connect to JDBCServer, the client must specify the namespace, which is the directory of JDBCServer instances in ZooKeeper. During the connection, a JDBCServer instance is randomly selected from the specified namespace. +#. After the connection succeeds, the client sends SQL statements to JDBCServer. +#. JDBCServer executes received SQL statements and returns results to the client. + +If multi-active instance HA of Spark JDBCServer is enabled, all JDBCServer instances are independent and equivalent. When one JDBCServer instance is interrupted during upgrade, other JDBCServer instances can accept the connection request from the client. + +The rules below must be followed in the multi-active instance HA of Spark JDBCServer. + +- If a JDBCServer instance exits abnormally, no other instance will take over the sessions and services running on the abnormal instance. +- When the JDBCServer process is stopped, corresponding nodes are deleted from ZooKeeper. +- The client randomly selects the server, which may result in uneven session allocation caused by random distribution of policy results, and finally result in load imbalance of instances. +- After the instance enters the maintenance mode (in which no new connection requests from clients are accepted), services running on the instance may fail when the decommissioning times out. + +- **URL Connection** + + - Multi-active instance mode + + In multi-active instance mode, the client reads content from the ZooKeeper node and connects to JDBCServer. The connection strings are list below. + + - Security mode: + + If Kinit authentication is enabled, the JDBCURL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.@; + + .. note:: + + - In the above JDBCURL, **:** indicates the ZooKeeper URL. Use commas (,) to separate multiple URLs, + + Example: 192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181. + + - **sparkthriftserver2x** indicates the ZooKeeper directory, where a random JDBCServer instance is connected to the client. + + For example, when you use Beeline client to connect JDBCServer, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.**\ **\ **@**\ **\ **;"** + + If Keytab authentication is enabled, the JDBCURL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.@;user.principal=;user.keytab= + + In the above URL, ** indicates the principal of the Kerberos user, for example, **test@**\ **; ** indicates the Keytab file path corresponding to **, for example, **/opt/auth/test/user.keytab**. + + - Common mode: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x; + + For example, when you use Beeline client, in normal mode, for connection, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;"** + + - Non-multi-active instance mode + + In this mode, a client connects to a specified JDBCServer node. Compared with multi-active instance mode, the connection string in this mode does not contain **serviceDiscoveryMode** and **zooKeeperNamespace** parameters about ZooKeeper. + + For example, when you use Beeline client, in security mode, to connect JDBCServer in non-multi-active instance mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:*\ **/;user.principal=spark/hadoop.**\ *@*\ **;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.**\ *@*\ **;"** + + .. note:: + + - In the above command, **:** indicates the URL of the specified JDBCServer node. + - **CLIENT_HOME** indicates the client path. + + Except the connection method, other operations of JDBCServer API in the two modes are the same. Spark JDBCServer is another implementation of HiveServer2 in Hive. For details about how to use Spark JDBCServer, see https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients. + +Spark Multi-Tenant HA +--------------------- + +In the JDBCServer multi-active instance solution, JDBCServer uses the Yarn-client mode, but there is only one Yarn resource queue available. To solve this resource limitation problem, the multi-tenant mode is introduced. + +In multi-tenant mode, JDBCServers are bound with tenants. Each tenant corresponds to one or more JDBCServers, and a JDBCServer provides services for only one tenant. Different tenants can be configured with different Yarn queues to implement resource isolation. In addition, JDBCServer can be dynamically started as required to avoid resource waste. + +- **Implementation** + + :ref:`Figure 2 ` shows the HA solution of the multi-tenant mode. + + .. _mrs_08_00082__fd976abf162d04390bb64dc2ab6d2d226: + + .. figure:: /_static/images/en-us_image_0000001349309945.png + :alt: **Figure 2** Multi-tenant mode of Spark JDBCServer + + **Figure 2** Multi-tenant mode of Spark JDBCServer + + #. When ProxyServer is started, it registers with ZooKeeper by writing node information in a specified directory. Node information includes the instance IP address, port number, version, and serial number. + + .. note:: + + In multi-tenant mode, the JDBCServer instance refers to the ProxyServer (JDBCServer proxy). + + #. To connect to ProxyServer, the client must specify a namespace, which is the directory of the ProxyServer instance where you want to access ZooKeeper. When the client connects to the ProxyServer, a random instance under the namespace is selected for connection. For details about the URL, see :ref:`URL Connection Overview `. + #. After the client successfully connects to the ProxyServer, which first checks whether the JDBCServer of a tenant exists. If yes, Beeline connects the JDBCServer. If no, a new JDBCServer is started in Yarn-cluster mode. After the startup of JDBCServer, ProxyServer obtains the IP address of the JDBCServer and establishes the connection between Beeline and JDBCServer. + #. The client sends SQL statements to ProxyServer, which forwards statements to the connected JDBCServer. JDBCServer returns the results to ProxyServer, which then returns the results to the client. + + In the multi-active instance HA mode, all instances are independent and equivalent. If one instance is interrupted during upgrade, other instances can accept the connection request from the client. + +- .. _mrs_08_00082__li4440192917386: + + **URL Connection Overview** + + - Multi-tenant mode + + In multi-tenant mode, the client reads content from the ZooKeeper node and connects to ProxyServer. The connection strings are list below. + + - Security mode: + + If Kinit authentication is enabled, the client URL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.@; + + .. note:: + + - In the above URL, **:** indicates the ZooKeeper URL. Use commas (,) to separate multiple URLs, + + Example: **192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181**. + + - **sparkthriftserver2x** indicates the ZooKeeper directory, where a random JDBCServer instance is connected to the client. + + For example, when you use Beeline client for connection, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.**\ **\ **@**\ **\ **;"** + + If Keytab authentication is enabled, the URL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.@;user.principal=;user.keytab= + + In the above URL, ** indicates the principal of the Kerberos user, for example, **test@**\ **; ** indicates the Keytab file path corresponding to **, for example, **/opt/auth/test/user.keytab**. + + - Common mode: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x; + + For example, run the following command when you use Beeline client for connection in normal mode: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;"** + + - Non-multi-tenant mode + + In non-multi-tenant mode, a client connects to a specified JDBCServer node. Compared with multi-tenant instance mode, the connection string in this mode does not contain **serviceDiscoveryMode** and **zooKeeperNamespace** parameters about ZooKeeper. + + For example, when you use Beeline client to connect JDBCServer in non-multi-tenant instance mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:*\ **/;user.principal=spark/hadoop.**\ *@*\ **;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.**\ *@*\ **;"** + + .. note:: + + - In the above command, **:** indicates the URL of the specified JDBCServer node. + - **CLIENT_HOME** indicates the client path. + + Except the connection method, other operations of JDBCServer API in multi-tenant mode and non-multi-tenant mode are the same. Spark JDBCServer is another implementation of HiveServer2 in Hive. For details about how to use Spark JDBCServer, go to the official Hive website at https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients. + + **Specifying a Tenant** + + Generally, the client submitted by a user connects to the default JDBCServer of the tenant to which the user belongs. If you want to connect the client to the JDBCServer of a specified tenant, add the **--hiveconf mapreduce.job.queuename** parameter. + + If you use Beeline client for connection, run the following command (**aaa** is the tenant name): + + **beeline --hiveconf mapreduce.job.queuename=aaa -u 'jdbc:hive2://192.168.39.30:2181,192.168.40.210:2181,192.168.215.97:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.\ \ @\ ;'** diff --git a/umn/source/overview/components/spark2x/basic_principles_of_spark2x.rst b/umn/source/overview/components/spark2x/basic_principles_of_spark2x.rst new file mode 100644 index 0000000..2e18430 --- /dev/null +++ b/umn/source/overview/components/spark2x/basic_principles_of_spark2x.rst @@ -0,0 +1,509 @@ +:original_name: mrs_08_007101.html + +.. _mrs_08_007101: + +Basic Principles of Spark2x +=========================== + +.. note:: + + The Spark2x component applies to MRS 3.x and later versions. + +Description +----------- + +`Spark `__ is a memory-based distributed computing framework. In iterative computation scenarios, the computing capability of Spark is 10 to 100 times higher than MapReduce, because data is stored in memory when being processed. Spark can use HDFS as the underlying storage system, enabling users to quickly switch to Spark from MapReduce. Spark provides one-stop data analysis capabilities, such as the streaming processing in small batches, offline batch processing, SQL query, and data mining. Users can seamlessly use these functions in a same application. For details about the new open-source features of Spark2x, see :ref:`Spark2x Open Source New Features `. + +Features of Spark are as follows: + +- Improves the data processing capability through distributed memory computing and directed acyclic graph (DAG) execution engine. The delivered performance is 10 to 100 times higher than that of MapReduce. +- Supports multiple development languages (Scala/Java/Python) and dozens of highly abstract operators to facilitate the construction of distributed data processing applications. +- Builds data processing stacks using `SQL `__, `Streaming `__, MLlib, and GraphX to provide one-stop data processing capabilities. +- Fits into the Hadoop ecosystem, allowing Spark applications to run on Standalone, Mesos, or Yarn, enabling access of multiple data sources such as HDFS, HBase, and Hive, and supporting smooth migration of the MapReduce application to Spark. + +Architecture +------------ + +:ref:`Figure 1 ` describes the Spark architecture and :ref:`Table 1 ` lists the Spark modules. + +.. _mrs_08_007101__f5fc385f92b0042f5ae12de6c4ee1d5af: + +.. figure:: /_static/images/en-us_image_0000001296430814.png + :alt: **Figure 1** Spark architecture + + **Figure 1** Spark architecture + +.. _mrs_08_007101__t91e81509880f400c9830f64cf706a81d: + +.. table:: **Table 1** Basic concepts + + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +=================+====================================================================================================================================================================================================================================================================+ + | Cluster Manager | Cluster manager manages resources in the cluster. Spark supports multiple cluster managers, including Mesos, Yarn, and the Standalone cluster manager that is delivered with Spark. By default, Spark clusters adopt the Yarn cluster manager. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Application | Spark application. It consists of one Driver Program and multiple executors. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Deploy Mode | Deployment in cluster or client mode. In cluster mode, the driver runs on a node inside the cluster. In client mode, the driver runs on the client (outside the cluster). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Driver Program | The main process of the Spark application. It runs the **main()** function of an application and creates SparkContext. It is used for parsing applications, generating stages, and scheduling tasks to executors. Usually, SparkContext represents Driver Program. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Executor | A process started on a Work Node. It is used to execute tasks, and manage and process the data used in applications. A Spark application usually contains multiple executors. Each executor receives commands from the driver and executes one or multiple tasks. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Worker Node | A node that starts and manages executors and resources in a cluster. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Job | A job consists of multiple concurrent tasks. One action operator (for example, a collect operator) maps to one job. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stage | Each job consists of multiple stages. Each stage is a task set, which is separated by Directed Acyclic Graph (DAG). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task | A task carries the computation unit of the service logics. It is the minimum working unit that can be executed on the Spark platform. An application can be divided into multiple tasks based on the execution plan and computation amount. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Spark Principle +--------------- + +:ref:`Figure 2 ` describes the application running architecture of Spark. + +#. An application is running in the cluster as a collection of processes. Driver coordinates the running of the application. +#. To run an application, Driver connects to the cluster manager (such as Standalone, Mesos, and Yarn) to apply for the executor resources, and start ExecutorBackend. The cluster manager schedules resources between different applications. Driver schedules DAGs, divides stages, and generates tasks for the application at the same time. +#. Then, Spark sends the codes of the application (the codes transferred to `SparkContext `__, which is defined by JAR or Python) to an executor. +#. After all tasks are finished, the running of the user application is stopped. + +.. _mrs_08_007101__f5f9e60b5ec204699b5c44c357b11ba9f: + +.. figure:: /_static/images/en-us_image_0000001296590670.png + :alt: **Figure 2** Spark application running architecture + + **Figure 2** Spark application running architecture + +Spark uses Master and Worker modes, as shown in :ref:`Figure 3 `. A user submits an application on the Spark client, and then the scheduler divides a job into multiple tasks and sends the tasks to each Worker for execution. Each Worker reports the computation results to Driver (Master), and then the Driver aggregates and returns the results to the client. + +.. _mrs_08_007101__fbc0ce41be7ab4095a4ea2f020e192c09: + +.. figure:: /_static/images/en-us_image_0000001349190385.png + :alt: **Figure 3** Spark Master-Worker mode + + **Figure 3** Spark Master-Worker mode + +Note the following about the architecture: + +- Applications are isolated from each other. + + Each application has an independent executor process, and each executor starts multiple threads to execute tasks in parallel. Each driver schedules its own tasks, and different application tasks run on different JVMs, that is, different executors. + +- Different Spark applications do not share data, unless data is stored in the external storage system such as HDFS. + +- You are advised to deploy the Driver program in a location that is close to the Worker node because the Driver program schedules tasks in the cluster. For example, deploy the Driver program on the network where the Worker node is located. + +Spark on YARN can be deployed in two modes: + +- In Yarn-cluster mode, the Spark driver runs inside an ApplicationMaster process which is managed by Yarn in the cluster. After the ApplicationMaster is started, the client can exit without interrupting service running. +- In Yarn-client mode, Driver runs in the client process, and the ApplicationMaster process is used only to apply for requesting resources from Yarn. + +Spark Streaming Principle +------------------------- + +Spark Streaming is a real-time computing framework built on the Spark, which expands the capability for processing massive streaming data. Spark supports two data processing approaches: Direct Streaming and Receiver. + +**Direct Streaming computing process** + +In Direct Streaming approach, Direct API is used to process data. Take Kafka Direct API as an example. Direct API provides offset location that each batch range will read from, which is much simpler than starting a receiver to continuously receive data from Kafka and written data to write-ahead logs (WALs). Then, each batch job is running and the corresponding offset data is ready in Kafka. These offset information can be securely stored in the checkpoint file and read by applications that failed to start. + + +.. figure:: /_static/images/en-us_image_0000001296430810.png + :alt: **Figure 4** Data transmission through Direct Kafka API + + **Figure 4** Data transmission through Direct Kafka API + +After the failure, Spark Streaming can read data from Kafka again and process the data segment. The processing result is the same no matter Spark Streaming fails or not, because the semantic is processed only once. + +Direct API does not need to use the WAL and Receivers, and ensures that each Kafka record is received only once, which is more efficient. In this way, the Spark Streaming and Kafka can be well integrated, making streaming channels be featured with high fault-tolerance, high efficiency, and ease-of-use. Therefore, you are advised to use Direct Streaming to process data. + +**Receiver computing process** + +When a Spark Streaming application starts (that is, when the driver starts), the related StreamingContext (the basis of all streaming functions) uses SparkContext to start the receiver to become a long-term running task. These receivers receive and save streaming data to the Spark memory for processing. :ref:`Figure 5 ` shows the data transfer lifecycle. + +.. _mrs_08_007101__f9c8691e22ba04d57bcc3c758ff0138f3: + +.. figure:: /_static/images/en-us_image_0000001296270846.png + :alt: **Figure 5** Data transfer lifecycle + + **Figure 5** Data transfer lifecycle + +#. Receive data (blue arrow). + + Receiver divides a data stream into a series of blocks and stores them in the executor memory. In addition, after WAL is enabled, it writes data to the WAL of the fault-tolerant file system. + +#. Notify the driver (green arrow). + + The metadata in the received block is sent to StreamingContext in the driver. The metadata includes: + + - Block reference ID used to locate the data position in the Executor memory. + - Block data offset information in logs (if the WAL function is enabled). + +#. Process data (red arrow). + + For each batch of data, StreamingContext uses block information to generate resilient distributed datasets (RDDs) and jobs. StreamingContext executes jobs by running tasks to process blocks in the executor memory. + +#. Periodically set checkpoints (orange arrows). + +#. For fault tolerance, StreamingContext periodically sets checkpoints and saves them to external file systems. + +**Fault Tolerance** + +Spark and its RDD allow seamless processing of failures of any Worker node in the cluster. Spark Streaming is built on top of Spark. Therefore, the Worker node of Spark Streaming also has the same fault tolerance capability. However, Spark Streaming needs to run properly in case of long-time running. Therefore, Spark must be able to recover from faults through the driver process (main process that coordinates all Workers). This poses challenges to the Spark driver fault-tolerance because the Spark driver may be any user application implemented in any computation mode. However, Spark Streaming has internal computation architecture. That is, it periodically executes the same Spark computation in each batch data. Such architecture allows it to periodically store checkpoints to reliable storage space and recover them upon the restart of Driver. + +For source data such as files, the Driver recovery mechanism can ensure zero data loss because all data is stored in a fault-tolerant file system such as HDFS. However, for other data sources such as Kafka and Flume, some received data is cached only in memory and may be lost before being processed. This is caused by the distribution operation mode of Spark applications. When the driver process fails, all executors running in the Cluster Manager, together with all data in the memory, are terminated. To avoid such data loss, the WAL function is added to Spark Streaming. + +WAL is often used in databases and file systems to ensure persistence of any data operation. That is, first record an operation to a persistent log and perform this operation on data. If the operation fails, the system is recovered by reading the log and re-applying the preset operation. The following describes how to use WAL to ensure persistence of received data: + +Receiver is used to receive data from data sources such as Kafka. As a long-time running task in Executor, Receiver receives data, and also confirms received data if supported by data sources. Received data is stored in the Executor memory, and Driver delivers a task to Executor for processing. + +After WAL is enabled, all received data is stored to log files in the fault-tolerant file system. Therefore, the received data does not lose even if Spark Streaming fails. Besides, receiver checks correctness of received data only after the data is pre-written into logs. Data that is cached but not stored can be sent again by data sources after the driver restarts. These two mechanisms ensure zero data loss. That is, all data is recovered from logs or re-sent by data sources. + +To enable the WAL function, perform the following operations: + +- Set **streamingContext.checkpoint** (path-to-directory) to configure the checkpoint directory, which is an HDFS file path used to store streaming checkpoints and WALs. +- Set **spark.streaming.receiver.writeAheadLog.enable** of SparkConf to **true** (the default value is **false**). + +After WAL is enabled, all receivers have the advantage of recovering from reliable received data. You are advised to disable the multi-replica mechanism because the fault-tolerant file system of WAL may also replicate the data. + +.. note:: + + The data receiving throughput is lowered after WAL is enabled. All data is written into the fault-tolerant file system. As a result, the write throughput of the file system and the network bandwidth for data replication may become the potential bottleneck. To solve this problem, you are advised to create more receivers to increase the degree of data receiving parallelism or use better hardware to improve the throughput of the fault-tolerant file system. + +**Recovery Process** + +When a failed driver is restarted, restart it as follows: + + +.. figure:: /_static/images/en-us_image_0000001349110509.png + :alt: **Figure 6** Computing recovery process + + **Figure 6** Computing recovery process + +#. Recover computing. (Orange arrow) + + Use checkpoint information to restart Driver, reconstruct SparkContext and restart Receiver. + +#. Recover metadata block. (Green arrow) + + This operation ensures that all necessary metadata blocks are recovered to continue the subsequent computing recovery. + +#. Relaunch unfinished jobs. (Red arrow) + + Recovered metadata is used to generate RDDs and corresponding jobs for interrupted batch processing due to failures. + +#. Read block data saved in logs. (Blue arrow) + + Block data is directly read from WALs during execution of the preceding jobs, and therefore all essential data reliably stored in logs is recovered. + +#. Resend unconfirmed data. (Purple arrow) + + Data that is cached but not stored to logs upon failures is re-sent by data sources, because the receiver does not confirm the data. + +Therefore, by using WALs and reliable Receiver, Spark Streaming can avoid input data loss caused by Driver failures. + +.. _mrs_08_007101__s92e24e65bbb14e90a11ac77bda16b394: + +SparkSQL and DataSet Principle +------------------------------ + +**SparkSQL** + + +.. figure:: /_static/images/en-us_image_0000001349190381.png + :alt: **Figure 7** SparkSQL and DataSet + + **Figure 7** SparkSQL and DataSet + +Spark SQL is a module for processing structured data. In Spark application, SQL statements or DataSet APIs can be seamlessly used for querying structured data. + +Spark SQL and DataSet also provide a universal method for accessing multiple data sources such as Hive, CSV, Parquet, ORC, JSON, and JDBC. These data sources also allow data interaction. Spark SQL reuses the Hive frontend processing logic and metadata processing module. With the Spark SQL, you can directly query existing Hive data. + +In addition, Spark SQL also provides API, CLI, and JDBC APIs, allowing diverse accesses to the client. + +**Spark SQL Native DDL/DML** + +In Spark 1.5, lots of Data Definition Language (DDL)/Data Manipulation Language (DML) commands are pushed down to and run on the Hive, causing coupling with the Hive and inflexibility such as unexpected error reports and results. + +Spark2x realizes command localization and replaces the Hive with Spark SQL Native DDL/DML to run DDL/DML commands. Additionally, the decoupling from the Hive is realized and commands can be customized. + +**DataSet** + +A DataSet is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. + +The DataFrame is a structured and distributed dataset consisting of multiple columns. The DataFrame is equal to a table in the relationship database or the DataFrame in the R/Python. The DataFrame is the most basic concept in the Spark SQL, which can be created by using multiple methods, such as the structured dataset, Hive table, external database or RDD. + +Operations available on DataSets are divided into transformations and actions. + +- A transformation operation can generate a new DataSet, + + for example, **map**, **filter**, **select**, and **aggregate (groupBy)**. + +- An action operation can trigger computation and return results, + + for example, **count**, **show**, or write data to the file system. + +You can use either of the following methods to create a DataSet: + +- The most common way is by pointing Spark to some files on storage systems, using the **read** function available on a SparkSession. + + .. code-block:: + + val people = spark.read.parquet("...").as[Person] // Scala + + .. code-block:: + + DataSet people = spark.read().parquet("...").as(Encoders.bean(Person.class));//Java + +- You can also create a DataSet using the transformation operation available on an existing one. For example, apply the map operation on an existing DataSet to create a DataSet: + + .. code-block:: + + val names = people.map(_.name) // In Scala: names is Dataset. + + .. code-block:: + + Dataset names = people.map((Person p) -> p.name, Encoders.STRING)); // Java + +**CLI and JDBCServer** + +In addition to programming APIs, Spark SQL also provides the CLI/JDBC APIs. + +- Both **spark-shell** and **spark-sql** scripts can provide the CLI for debugging. +- JDBCServer provides JDBC APIs. External systems can directly send JDBC requests to calculate and parse structured data. + +.. _mrs_08_007101__s0ca0926d38ac4e2c9ce59d0bb4286a4e: + +SparkSession Principle +---------------------- + +SparkSession is a unified API in Spark2x and can be regarded as a unified entry for reading data. SparkSession provides a single entry point to perform many operations that were previously scattered across multiple classes, and also provides accessor methods to these older classes to maximize compatibility. + +A SparkSession can be created using a builder pattern. The builder will automatically reuse the existing SparkSession if there is a SparkSession; or create a SparkSession if it does not exist. During I/O transactions, the configuration item settings in the builder are automatically synchronized to Spark and Hadoop. + +.. code-block:: + + import org.apache.spark.sql.SparkSession + val sparkSession = SparkSession.builder + .master("local") + .appName("my-spark-app") + .config("spark.some.config.option", "config-value") + .getOrCreate() + +- SparkSession can be used to execute SQL queries on data and return results as DataFrame. + + .. code-block:: + + sparkSession.sql("select * from person").show + +- SparkSession can be used to set configuration items during running. These configuration items can be replaced with variables in SQL statements. + + .. code-block:: + + sparkSession.conf.set("spark.some.config", "abcd") + sparkSession.conf.get("spark.some.config") + sparkSession.sql("select ${spark.some.config}") + +- SparkSession also includes a "catalog" method that contains methods to work with Metastore (data catalog). After this method is used, a dataset is returned, which can be run using the same Dataset API. + + .. code-block:: + + val tables = sparkSession.catalog.listTables() + val columns = sparkSession.catalog.listColumns("myTable") + +- Underlying SparkContext can be accessed by SparkContext API of SparkSession. + + .. code-block:: + + val sparkContext = sparkSession.sparkContext + +.. _mrs_08_007101__s5a65faea60814b8f9286a52205188420: + +Structured Streaming Principle +------------------------------ + +Structured Streaming is a stream processing engine built on the Spark SQL engine. You can use the Dataset/DataFrame API in Scala, Java, Python, or R to express streaming aggregations, event-time windows, and stream-stream joins. If streaming data is incrementally and continuously produced, Spark SQL will continue to process the data and synchronize the result to the result set. In addition, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpoints and WALs. + +The core of Structured Streaming is to take streaming data as an incremental database table. Similar to the data block processing model, the streaming data processing model applies query operations on a static database table to streaming computing, and Spark uses standard SQL statements for query, to obtain data from the incremental and unbounded table. + + +.. figure:: /_static/images/en-us_image_0000001349390677.png + :alt: **Figure 8** Unbounded table of Structured Streaming + + **Figure 8** Unbounded table of Structured Streaming + +Each query operation will generate a result table. At each trigger interval, updated data will be synchronized to the result table. Whenever the result table is updated, the updated result will be written into an external storage system. + + +.. figure:: /_static/images/en-us_image_0000001296750282.png + :alt: **Figure 9** Structured Streaming data processing model + + **Figure 9** Structured Streaming data processing model + +Storage modes of Structured Streaming at the output phase are as follows: + +- Complete Mode: The updated result sets are written into the external storage system. The write operation is performed by a connector of the external storage system. +- Append Mode: If an interval is triggered, only added data in the result table will be written into an external system. This is applicable only on the queries where existing rows in the result table are not expected to change. +- Update Mode: If an interval is triggered, only updated data in the result table will be written into an external system, which is the difference between the Complete Mode and Update Mode. + +Concepts +-------- + +- **RDD** + + Resilient Distributed Dataset (RDD) is a core concept of Spark. It indicates a read-only and partitioned distributed dataset. Partial or all data of this dataset can be cached in the memory and reused between computations. + + **RDD Creation** + + - An RDD can be created from the input of HDFS or other storage systems that are compatible with Hadoop. + - A new RDD can be converted from a parent RDD. + - An RDD can be converted from a collection of datasets through encoding. + + **RDD Storage** + + - You can select different storage levels to store an RDD for reuse. (There are 11 storage levels to store an RDD.) + - By default, the RDD is stored in the memory. When the memory is insufficient, the RDD overflows to the disk. + +- **RDD Dependency** + + The RDD dependency includes the narrow dependency and wide dependency. + + + .. figure:: /_static/images/en-us_image_0000001296590666.png + :alt: **Figure 10** RDD dependency + + **Figure 10** RDD dependency + + - **Narrow dependency**: Each partition of the parent RDD is used by at most one partition of the child RDD. + - **Wide dependency**: Partitions of the child RDD depend on all partitions of the parent RDD. + + The narrow dependency facilitates the optimization. Logically, each RDD operator is a fork/join (the join is not the join operator mentioned above but the barrier used to synchronize multiple concurrent tasks); fork the RDD to each partition, and then perform the computation. After the computation, join the results, and then perform the fork/join operation on the next RDD operator. It is uneconomical to directly translate the RDD into physical implementation. The first is that every RDD (even intermediate result) needs to be physicalized into memory or storage, which is time-consuming and occupies much space. The second is that as a global barrier, the join operation is very expensive and the entire join process will be slowed down by the slowest node. If the partitions of the child RDD narrowly depend on that of the parent RDD, the two fork/join processes can be combined to implement classic fusion optimization. If the relationship in the continuous operator sequence is narrow dependency, multiple fork/join processes can be combined to reduce a large number of global barriers and eliminate the physicalization of many RDD intermediate results, which greatly improves the performance. This is called pipeline optimization in Spark. + +- **Transformation and Action (RDD Operations)** + + Operations on RDD include transformation (the return value is an RDD) and action (the return value is not an RDD). :ref:`Figure 11 ` shows the RDD operation process. The transformation is lazy, which indicates that the transformation from one RDD to another RDD is not immediately executed. Spark only records the transformation but does not execute it immediately. The real computation is started only when the action is started. The action returns results or writes the RDD data into the storage system. The action is the driving force for Spark to start the computation. + + .. _mrs_08_007101__f9dd728605ad34d6dbbb494f1a2dac9e8: + + .. figure:: /_static/images/en-us_image_0000001296270850.png + :alt: **Figure 11** RDD operation + + **Figure 11** RDD operation + + The data and operation model of RDD are quite different from those of Scala. + + .. code-block:: + + val file = sc.textFile("hdfs://...") + val errors = file.filter(_.contains("ERROR")) + errors.cache() + errors.count() + + #. The textFile operator reads log files from the HDFS and returns files (as an RDD). + #. The filter operator filters rows with **ERROR** and assigns them to errors (a new RDD). The filter operator is a transformation. + #. The cache operator caches errors for future use. + #. The count operator returns the number of rows of errors. The count operator is an action. + + **Transformation includes the following types:** + + - The RDD elements are regarded as simple elements. + + The input and output has the one-to-one relationship, and the partition structure of the result RDD remains unchanged, for example, map. + + The input and output has the one-to-many relationship, and the partition structure of the result RDD remains unchanged, for example, flatMap (one element becomes a sequence containing multiple elements after map and then flattens to multiple elements). + + The input and output has the one-to-one relationship, but the partition structure of the result RDD changes, for example, union (two RDDs integrates to one RDD, and the number of partitions becomes the sum of the number of partitions of two RDDs) and coalesce (partitions are reduced). + + Operators of some elements are selected from the input, such as filter, distinct (duplicate elements are deleted), subtract (elements only exist in this RDD are retained), and sample (samples are taken). + + - The RDD elements are regarded as key-value pairs. + + Perform the one-to-one calculation on the single RDD, such as mapValues (the partition mode of the source RDD is retained, which is different from map). + + Sort the single RDD, such as sort and partitionBy (partitioning with consistency, which is important to the local optimization). + + Restructure and reduce the single RDD based on key, such as groupByKey and reduceByKey. + + Join and restructure two RDDs based on the key, such as join and cogroup. + + .. note:: + + The later three operations involving sorting are called shuffle operations. + + **Action includes the following types:** + + - Generate scalar configuration items, such as **count** (the number of elements in the returned RDD), **reduce**, **fold/aggregate** (the number of scalar configuration items that are returned), and **take** (the number of elements before the return). + - Generate the Scala collection, such as **collect** (import all elements in the RDD to the Scala collection) and **lookup** (look up all values corresponds to the key). + - Write data to the storage, such as **saveAsTextFile** (which corresponds to the preceding **textFile**). + - Check points, such as the **checkpoint** operator. When Lineage is quite long (which occurs frequently in graphics computation), it takes a long period of time to execute the whole sequence again when a fault occurs. In this case, checkpoint is used as the check point to write the current data to stable storage. + +- **Shuffle** + + Shuffle is a specific phase in the MapReduce framework, which is located between the Map phase and the Reduce phase. If the output results of Map are to be used by Reduce, the output results must be hashed based on a key and distributed to each Reducer. This process is called Shuffle. Shuffle involves the read and write of the disk and the transmission of the network, so that the performance of Shuffle directly affects the operation efficiency of the entire program. + + The figure below shows the entire process of the MapReduce algorithm. + + + .. figure:: /_static/images/en-us_image_0000001349110513.png + :alt: **Figure 12** Algorithm process + + **Figure 12** Algorithm process + + Shuffle is a bridge to connect data. The following describes the implementation of shuffle in Spark. + + Shuffle divides a job of Spark into multiple stages. The former stages contain one or more ShuffleMapTasks, and the last stage contains one or more ResultTasks. + +- **Spark Application Structure** + + The Spark application structure includes the initialized SparkContext and the main program. + + - Initialized SparkContext: constructs the operating environment of the Spark Application. + + Constructs the SparkContext object. The following is an example: + + .. code-block:: + + new SparkContext(master, appName, [SparkHome], [jars]) + + Parameter description: + + **master**: indicates the link string. The link modes include local, Yarn-cluster, and Yarn-client. + + **appName**: indicates the application name. + + **SparkHome**: indicates the directory where Spark is installed in the cluster. + + **jars**: indicates the code and dependency package of an application. + + - Main program: processes data. + + For details about how to submit an application, visit https://spark.apache.org/docs/3.1.1/submitting-applications.html. + +- **Spark Shell Commands** + + The basic Spark shell commands support the submission of Spark applications. The Spark shell commands are as follows: + + .. code-block:: + + ./bin/spark-submit \ + --class \ + --master \ + ... # other options + \ + [application-arguments] + + Parameter description: + + **--class**: indicates the name of the class of a Spark application. + + **--master**: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster. + + **application-jar**: indicates the path of the JAR file of the Spark application. + + **application-arguments**: indicates the parameter required to submit the Spark application. This parameter can be left blank. + +- **Spark JobHistory Server** + + The Spark web UI is used to monitor the details in each phase of the Spark framework of a running or historical Spark job and provide the log display, which helps users to develop, configure, and optimize the job in more fine-grained units. diff --git a/umn/source/overview/components/spark2x/index.rst b/umn/source/overview/components/spark2x/index.rst new file mode 100644 index 0000000..cf19d8e --- /dev/null +++ b/umn/source/overview/components/spark2x/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_08_0071.html + +.. _mrs_08_0071: + +Spark2x +======= + +- :ref:`Basic Principles of Spark2x ` +- :ref:`Spark2x HA Solution ` +- :ref:`Relationship Between Spark2x and Other Components ` +- :ref:`Spark2x Open Source New Features ` +- :ref:`Spark2x Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + basic_principles_of_spark2x + spark2x_ha_solution/index + relationship_between_spark2x_and_other_components + spark2x_open_source_new_features + spark2x_enhanced_open_source_features/index diff --git a/umn/source/overview/components/spark2x/relationship_between_spark2x_and_other_components.rst b/umn/source/overview/components/spark2x/relationship_between_spark2x_and_other_components.rst new file mode 100644 index 0000000..18b5509 --- /dev/null +++ b/umn/source/overview/components/spark2x/relationship_between_spark2x_and_other_components.rst @@ -0,0 +1,109 @@ +:original_name: mrs_08_007105.html + +.. _mrs_08_007105: + +Relationship Between Spark2x and Other Components +================================================= + +Relationship Between Spark and HDFS +----------------------------------- + +Data computed by Spark comes from multiple data sources, such as local files and HDFS. Most data comes from HDFS which can read data in large scale for parallel computing After being computed, data can be stored in HDFS. + +Spark involves Driver and Executor. Driver schedules tasks and Executor runs tasks. + +:ref:`Figure 1 ` describes the file reading process. + +.. _mrs_08_007105__f685f281350dc4fd3b98d846f3e41143d: + +.. figure:: /_static/images/en-us_image_0000001349110517.png + :alt: **Figure 1** File reading process + + **Figure 1** File reading process + +The file reading process is as follows: + +#. Driver interconnects with HDFS to obtain the information of File A. +#. The HDFS returns the detailed block information about this file. +#. Driver sets a parallel degree based on the block data amount, and creates multiple tasks to read the blocks of this file. +#. Executor runs the tasks and reads the detailed blocks as part of the Resilient Distributed Dataset (RDD). + +:ref:`Figure 2 ` describes the file writing process. + +.. _mrs_08_007105__f18154e3da87f42c5976570cd262f7ec1: + +.. figure:: /_static/images/en-us_image_0000001349309973.png + :alt: **Figure 2** File writing process + + **Figure 2** File writing process + +The file writing process is as follows: + +#. .. _mrs_08_007105__l4c0b15c64e104bfb9f02ae2489ed59ce: + + Driver creates a directory where the file is to be written. + +#. Based on the RDD distribution status, the number of tasks related to data writing is computed, and these tasks are sent to Executor. + +#. Executor runs these tasks, and writes the RDD data to the directory created in :ref:`1 `. + +Relationship with Yarn +---------------------- + +The Spark computing and scheduling can be implemented using Yarn mode. Spark enjoys the computing resources provided by Yarn clusters and runs tasks in a distributed way. Spark on Yarn has two modes: Yarn-cluster and Yarn-client. + +- Yarn-cluster mode + + :ref:`Figure 3 ` describes the operation framework. + + .. _mrs_08_007105__f0c19aec9e28a4f2fbdd56db1a318f641: + + .. figure:: /_static/images/en-us_image_0000001296270854.png + :alt: **Figure 3** Spark on Yarn-cluster operation framework + + **Figure 3** Spark on Yarn-cluster operation framework + + Spark on Yarn-cluster implementation process: + + #. The client generates the application information, and then sends the information to ResourceManager. + + #. ResourceManager allocates the first container (ApplicationMaster) to SparkApplication and starts the driver on the container. + + #. ApplicationMaster applies for resources from ResourceManager to run the container. + + ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers and starts the executor in the obtained container. After the executor is started, it registers with drivers and applies for tasks. + + #. Drivers allocate tasks to the executors. + + #. Executors run tasks and report the operating status to Drivers. + +- Yarn-client mode + + :ref:`Figure 4 ` describes the operation framework. + + .. _mrs_08_007105__f627f617b8382474ba78da7751a10219a: + + .. figure:: /_static/images/en-us_image_0000001349390685.png + :alt: **Figure 4** Spark on Yarn-client operation framework + + **Figure 4** Spark on Yarn-client operation framework + + Spark on Yarn-client implementation process: + + .. note:: + + In Yarn-client mode, the Driver is deployed and started on the client. In Yarn-client mode, the client of an earlier version is incompatible. The Yarn-cluster mode is recommended. + + #. The client sends the Spark application request to ResourceManager, and packages all information required to start ApplicationMaster and sends the information to ResourceManager. ResourceManager then returns the results to the client. The results include information such as ApplicationId, and the upper limit as well as lower limit of available resources. After receiving the request, ResourceManager finds a proper node for ApplicationMaster and starts it on this node. ApplicationMaster is a role in Yarn, and the process name in Spark is ExecutorLauncher. + + #. Based on the resource requirements of each task, ApplicationMaster can apply for a series of containers to run tasks from ResourceManager. + + #. After receiving the newly allocated container list (from ResourceManager), ApplicationMaster sends information to the related NodeManagers to start the containers. + + ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers and starts the executor in the obtained container. After the executor is started, it registers with drivers and applies for tasks. + + .. note:: + + Running Containers will not be suspended to release resources. + + #. Drivers allocate tasks to the executors. Executors run tasks and report the operating status to Drivers. diff --git a/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/carbondata_overview.rst b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/carbondata_overview.rst new file mode 100644 index 0000000..3090ca9 --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/carbondata_overview.rst @@ -0,0 +1,56 @@ +:original_name: mrs_08_007108.html + +.. _mrs_08_007108: + +CarbonData Overview +=================== + +CarbonData is a new Apache Hadoop native data-store format. CarbonData allows faster interactive queries over PetaBytes of data using advanced columnar storage, index, compression, and encoding techniques to improve computing efficiency. In addition, CarbonData is also a high-performance analysis engine that integrates data sources with Spark. + + +.. figure:: /_static/images/en-us_image_0000001296750278.png + :alt: **Figure 1** Basic architecture of CarbonData + + **Figure 1** Basic architecture of CarbonData + +The purpose of using CarbonData is to provide quick response to ad hoc queries of big data. Essentially, CarbonData is an Online Analytical Processing (OLAP) engine, which stores data by using tables similar to those in Relational Database Management System (RDBMS). You can import more than 10 TB data to tables created in CarbonData format, and CarbonData automatically organizes and stores data using the compressed multi-dimensional indexes. After data is loaded to CarbonData, CarbonData responds to ad hoc queries in seconds. + +CarbonData integrates data sources into the Spark ecosystem and you can query and analyze the data using Spark SQL. You can also use the third-party tool JDBCServer provided by Spark to connect to SparkSQL. + +Topology of CarbonData +---------------------- + +CarbonData runs as a data source inside Spark. Therefore, CarbonData does not start any additional processes on nodes in clusters. CarbonData engine runs inside the Spark executor. + + +.. figure:: /_static/images/en-us_image_0000001296590662.png + :alt: **Figure 2** Topology of CarbonData + + **Figure 2** Topology of CarbonData + +Data stored in CarbonData Table is divided into several CarbonData data files. Each time when data is queried, CarbonData Engine reads and filters data sets. CarbonData Engine runs as a part of the Spark Executor process and is responsible for handling a subset of data file blocks. + +Table data is stored in HDFS. Nodes in the same Spark cluster can be used as HDFS data nodes. + +CarbonData Features +------------------- + +- SQL: CarbonData is compatible with Spark SQL and supports SQL query operations performed on Spark SQL. +- Simple Table dataset definition: CarbonData allows you to define and create datasets by using user-friendly Data Definition Language (DDL) statements. CarbonData DDL is flexible and easy to use, and can define complex tables. +- Easy data management: CarbonData provides various data management functions for data loading and maintenance. CarbonData supports bulk loading of historical data and incremental loading of new data. Loaded data can be deleted based on load time and a specific loading operation can be undone. +- CarbonData file format is a columnar store in HDFS. This format has many new column-based file storage features, such as table splitting and data compression. CarbonData has the following characteristics: + + - Stores data along with index: Significantly accelerates query performance and reduces the I/O scans and CPU resources, when there are filters in the query. CarbonData index consists of multiple levels of indices. A processing framework can leverage this index to reduce the task that needs to be schedules and processed, and it can also perform skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file. + - Operable encoded data: Through supporting efficient compression and global encoding schemes, CarbonData can query on compressed/encoded data. The data can be converted just before returning the results to the users, which is called late materialized. + - Supports various use cases with one single data format: like interactive OLAP-style query, sequential access (big scan), and random access (narrow scan). + +Key Technologies and Advantages of CarbonData +--------------------------------------------- + +- Quick query response: CarbonData features high-performance query. The query speed of CarbonData is 10 times of that of Spark SQL. It uses dedicated data formats and applies multiple index technologies, global dictionary code, and multiple push-down optimizations, providing quick response to TB-level data queries. +- Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression algorithms. This significantly saves 60% to 80% data storage space and the hardware storage cost. + +CarbonData Index Cache Server +----------------------------- + +To solve the pressure and problems brought by the increasing data volume to the driver, an independent index cache server is introduced to separate the index from the Spark application side of Carbon query. All index content is managed by the index cache server. Spark applications obtain required index data in RPC mode. In this way, a large amount of memory on the service side is released so that services are not affected by the cluster scale and the performance or functions are not affected. diff --git a/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/index.rst b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/index.rst new file mode 100644 index 0000000..b0340bd --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_007107.html + +.. _mrs_08_007107: + +Spark2x Enhanced Open Source Features +===================================== + +- :ref:`CarbonData Overview ` +- :ref:`Optimizing SQL Query of Data of Multiple Sources ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + carbondata_overview + optimizing_sql_query_of_data_of_multiple_sources diff --git a/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/optimizing_sql_query_of_data_of_multiple_sources.rst b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/optimizing_sql_query_of_data_of_multiple_sources.rst new file mode 100644 index 0000000..f9ab7ba --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_enhanced_open_source_features/optimizing_sql_query_of_data_of_multiple_sources.rst @@ -0,0 +1,91 @@ +:original_name: mrs_08_007109.html + +.. _mrs_08_007109: + +Optimizing SQL Query of Data of Multiple Sources +================================================ + +Scenario +-------- + +Enterprises usually store massive data, such as from various databases and warehouses, for management and information collection. However, diversified data sources, hybrid dataset structures, and scattered data storage lower query efficiency. + +The open source Spark only supports simple filter pushdown during querying of multi-source data. The SQL engine performance is deteriorated due of a large amount of unnecessary data transmission. The pushdown function is enhanced, so that **aggregate**, complex **projection**, and complex **predicate** can be pushed to data sources, reducing unnecessary data transmission and improving query performance. + +Only the JDBC data source supports pushdown of query operations, such as **aggregate**, **projection**, **predicate**, **aggregate over inner join**, and **aggregate over union all**. All pushdown operations can be enabled based on your requirements. + +.. table:: **Table 1** Enhanced query of cross-source query + + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Before Enhancement | After Enhancement | + +===========================+====================================================================================================================================+===============================================================================================================================================================================================================================================================+ + | aggregate | The pushdown of **aggregate** is not supported. | - Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | | + | | | Example: select count(``*``) from table | + | | | | + | | | - Internal expressions of aggregation functions are supported. | + | | | | + | | | Example: select sum(a+b) from table | + | | | | + | | | - Calculation of aggregation functions is supported. Example: select avg(a) + max(b) from table | + | | | | + | | | - Pushdown of **having** is supported. | + | | | | + | | | Example: select sum(a) from table where a>0 group by b having sum(a)>10 | + | | | | + | | | - Pushdown of some functions is supported. | + | | | | + | | | Pushdown of lines in mathematics, time, and string functions, such as **abs()**, **month()**, and **length()** are supported. In addition to the preceding built-in functions, you can run the **SET** command to add functions supported by data sources. | + | | | | + | | | Example: select sum(abs(a)) from table | + | | | | + | | | - Pushdown of **limit** and **order by** after **aggregate** is supported. However, the pushdown is not supported in Oracle, because Oracle does not support **limit**. | + | | | | + | | | Example: select sum(a) from table where a>0 group by b order by sum(a) limit 5 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | projection | Only pushdown of simple **projection** is supported. Example: select a, b from table | - Complex expressions can be pushed down. | + | | | | + | | | Example: select (a+b)*c from table | + | | | | + | | | - Some functions can be pushed down. For details, see the description below the table. | + | | | | + | | | Example: select length(a)+abs(b) from table | + | | | | + | | | - Pushdown of **limit** and **order by** after **projection** is supported. | + | | | | + | | | Example: select a, b+c from table order by a limit 3 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | predicate | Only simple filtering with the column name on the left of the operator and values on the right is supported. Example: | - Complex expression pushdown is supported. | + | | | | + | | select \* from table where a>0 or b in ("aaa", "bbb") | Example: select \* from table where a+b>c*d or a/c in (1, 2, 3) | + | | | | + | | | - Some functions can be pushed down. For details, see the description below the table. | + | | | | + | | | Example: select \* from table where length(a)>5 | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | aggregate over inner join | Related data from the two tables must be loaded to Spark. The join operation must be performed before the **aggregate** operation. | The following functions are supported: | + | | | | + | | | - Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | - All **aggregate** operations can be performed in a same table. The **group by** operations can be performed on one or two tables and only inner join is supported. | + | | | | + | | | The following scenarios are not supported: | + | | | | + | | | - **aggregate** cannot be pushed down from both the left- and right-join tables. | + | | | - **aggregate** contains operations, for example, sum(a+b). | + | | | - **aggregate** operations, for example, sum(a)+min(b). | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | aggregate over union all | Related data from the two tables must be loaded to Spark. **union** must be performed before **aggregate**. | Supported scenarios: | + | | | | + | | | Aggregation functions including **sum**, **avg**, **max**, **min**, and **count** are supported. | + | | | | + | | | Unsupported scenarios: | + | | | | + | | | - **aggregate** contains operations, for example, sum(a+b). | + | | | - **aggregate** operations, for example, sum(a)+min(b). | + +---------------------------+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +- If external data source is Hive, query operation cannot be performed on foreign tables created by Spark. +- Only MySQL and MPPDB data sources are supported. diff --git a/umn/source/overview/components/spark2x/spark2x_ha_solution/index.rst b/umn/source/overview/components/spark2x/spark2x_ha_solution/index.rst new file mode 100644 index 0000000..b993b3b --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_ha_solution/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_08_007102.html + +.. _mrs_08_007102: + +Spark2x HA Solution +=================== + +- :ref:`Spark2x Multi-active Instance ` +- :ref:`Spark2x Multi-tenant ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + spark2x_multi-active_instance + spark2x_multi-tenant diff --git a/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-active_instance.rst b/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-active_instance.rst new file mode 100644 index 0000000..24d379a --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-active_instance.rst @@ -0,0 +1,115 @@ +:original_name: mrs_08_007103.html + +.. _mrs_08_007103: + +Spark2x Multi-active Instance +============================= + +Background +---------- + +Based on existing JDBCServers in the community, multi-active-instance HA is used to achieve the high availability. In this mode, multiple JDBCServers coexist in the cluster and the client can randomly connect any JDBCServer to perform service operations. When one or multiple JDBCServers stop working, a client can connect to another normal JDBCServer. + +Compared with active/standby HA, multi-active instance HA eliminates the following restrictions: + +- In active/standby HA, when the active/standby switchover occurs, the unavailable period cannot be controlled by JDBCServer, but determined by Yarn service resources. +- In Spark, the Thrift JDBC similar to HiveServer2 provides services and users access services through Beeline and JDBC API. Therefore, the processing capability of the JDBCServer cluster depends on the single-point capability of the primary server, and the scalability is insufficient. + +Multi-active instance HA not only prevents service interruption caused by switchover, but also enables cluster scale-out to secure high concurrency. + +Implementation +-------------- + +The following figure shows the basic principle of multi-active instance HA of Spark JDBCServer. + + +.. figure:: /_static/images/en-us_image_0000001349309965.png + :alt: **Figure 1** Spark JDBCServer HA + + **Figure 1** Spark JDBCServer HA + +#. After JDBCServer is started, it registers with ZooKeeper by writing node information in a specified directory. Node information includes the JDBCServer instance IP, port number, version, and serial number (information of different nodes is separated by commas). + + An example is provided as follows: + + .. code-block:: + + [serverUri=192.168.169.84:22550 + ;version=8.1.0.1;sequence=0000001244,serverUri=192.168.195.232:22550 ;version=8.1.0.1;sequence=0000001242,serverUri=192.168.81.37:22550 ;version=8.1.0.1;sequence=0000001243] + +#. To connect to JDBCServer, the client must specify the namespace, which is the directory of JDBCServer instances in ZooKeeper. During the connection, a JDBCServer instance is randomly selected from the specified namespace. For details about URL, see :ref:`URL Connection `. + +#. After the connection succeeds, the client sends SQL statements to JDBCServer. + +#. JDBCServer executes received SQL statements and sends results back to the client. + +In multi-active instance HA mode, all JDBCServer instances are independent and equivalent. When one instance is interrupted during upgrade, other JDBCServer instances can accept the connection request from the client. + +Following rules must be followed in the multi-active instance HA of Spark JDBCServer: + +- If a JDBCServer instance exits abnormally, no other instance will take over the sessions and services running on this abnormal instance. +- When the JDBCServer process is stopped, corresponding nodes are deleted from ZooKeeper. +- The client randomly selects the server, which may result in uneven session allocation, and finally result in imbalance of instance load. +- After the instance enters the maintenance mode (in which no new connection request from the client is accepted), services still running on the instance may fail when the decommissioning times out. + +.. _mrs_08_007103__s85fcf528f3e0410bb62ba828660aa83d: + +URL Connection +-------------- + +**Multi-active instance mode** + +In multi-active instance mode, the client reads content from the ZooKeeper node and connects to JDBCServer. The connection strings are as follows: + +- Security mode: + + - If Kinit authentication is enabled, the JDBCURL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.@; + + .. note:: + + - **:** indicates the ZooKeeper URL. Use commas (,) to separate multiple URLs, + + For example, **192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181**. + + - **sparkthriftserver2x** indicates the directory in ZooKeeper, where a random JDBCServer instance is connected to the client. + + For example, when you use Beeline client for connection in security mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.**\ **\ **@**\ **\ **;"** + + - If Keytab authentication is enabled, the JDBCURL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.@;user.principal=;user.keytab= + + ** indicates the principal of Kerberos user, for example, **test@**\ **. ** indicates the Keytab file path corresponding to **, for example, **/opt/auth/test/user.keytab**. + +- Common mode: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x; + + For example, when you use Beeline client for connection in common mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;"** + +**Non-multi-active instance mode** + +In non-multi-active instance mode, a client connects to a specified JDBCServer node. Compared with multi-active instance mode, the connection string in non-multi-active instance mode does not contain **serviceDiscoveryMode** and **zooKeeperNamespace** parameters about ZooKeeper. + +For example, when you use Beeline client to connect JDBCServer in non-multi-active instance mode, run the following command: + +**sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:*\ **/;user.principal=spark2x/hadoop.**\ *@*\ **;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.**\ *@*\ **;"** + +.. note:: + + - **:** indicates the URL of the specified JDBCServer node. + - **CLIENT_HOME** indicates the client path. + +Except the connection method, operations of JDBCServer API in multi-active instance mode and non-multi-active instance mode are the same. Spark JDBCServer is another implementation of HiveServer2 in Hive. For details about other operations, see official website of Hive at https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients. diff --git a/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-tenant.rst b/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-tenant.rst new file mode 100644 index 0000000..1c67db3 --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_ha_solution/spark2x_multi-tenant.rst @@ -0,0 +1,118 @@ +:original_name: mrs_08_007104.html + +.. _mrs_08_007104: + +Spark2x Multi-tenant +==================== + +Background +---------- + +In the JDBCServer multi-active instance mode, JDBCServer implements the Yarn-client mode but only one Yarn resource queue is available. To solve the resource limitation problem, the multi-tenant mode is introduced. + +In multi-tenant mode, JDBCServers are bound with tenants. Each tenant corresponds to one or more JDBCServers, and a JDBCServer provides services for only one tenant. Different tenants can be configured with different Yarn queues to implement resource isolation. In addition, JDBCServer can be dynamically started as required to avoid resource waste. + +Implementation +-------------- + +:ref:`Figure 1 ` shows the HA solution of the multi-tenant mode. + +.. _mrs_08_007104__fd976abf162d04390bb64dc2ab6d2d226: + +.. figure:: /_static/images/en-us_image_0000001349309933.png + :alt: **Figure 1** Multi-tenant mode of Spark JDBCServer + + **Figure 1** Multi-tenant mode of Spark JDBCServer + +#. When ProxyServer is started, it registers with ZooKeeper by writing node information in a specified directory. Node information includes the instance IP, port number, version, and serial number (information of different nodes is separated by commas). + + .. note:: + + In multi-tenant mode, the JDBCServer instance on MRS page indicates ProxyServer, the JDBCServer agent. + + An example is provided as follows: + + .. code-block:: + + serverUri=192.168.169.84:22550 + ;version=8.1.0.1;sequence=0000001244,serverUri=192.168.195.232:22550 + ;version=8.1.0.1;sequence=0000001242,serverUri=192.168.81.37:22550 + ;version=8.1.0.1;sequence=0000001243, + +#. To connect to ProxyServer, the client must specify a namespace, which is the directory of the ProxyServer instance that you want to access in ZooKeeper. When the client connects to ProxyServer, an instance under Namespace is randomly selected for connection. For details about the URL, see :ref:`URL Connection `. + +#. After the client successfully connects to ProxyServer, ProxyServer checks whether the JDBCServer of a tenant exists. If yes, Beeline connects the JDBCServer. If no, a new JDBCServer is started in Yarn-cluster mode. After the startup of JDBCServer, ProxyServer obtains the IP address of the JDBCServer and establishes the connection between Beeline and JDBCServer. + +#. The client sends SQL statements to ProxyServer, which then forwards statements to the connected JDBCServer. JDBCServer returns the results to ProxyServer, which then returns the results to the client. + +In multi-tenant HA mode, all ProxyServer instances are independent and equivalent. If one instance is interrupted during upgrade, other instances can accept the connection request from the client. + +.. _mrs_08_007104__s554f2fdef78b47bda50abc0b84bbbd5a: + +URL Connection +-------------- + +**Multi-tenant mode** + +In multi-tenant mode, the client reads content from the ZooKeeper node and connects to ProxyServer. The connection strings are as follows: + +- Security mode: + + - If Kinit authentication is enabled, the client URL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.@; + + .. note:: + + - **:** indicates the ZooKeeper URL. Use commas (,) to separate multiple URLs, + + For example, **192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181**. + + - **sparkthriftserver2x** indicates the ZooKeeper directory, where a random JDBCServer instance is connected to the client. + + For example, when you use Beeline client for connection in security mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.**\ **\ **@**\ **\ **;"** + + - If Keytab authentication is enabled, the URL is as follows: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.@;user.principal=;user.keytab= + + ** indicates the principal of Kerberos user, for example, **test@**\ **. ** indicates the Keytab file path corresponding to **, for example, **/opt/auth/test/user.keytab**. + +- Common mode: + + .. code-block:: + + jdbc:hive2://:,:,:/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x; + + For example, when you use Beeline client for connection in common mode, run the following command: + + **sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:,:,:*\ **/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;"** + +**Non-multi-tenant mode** + +In non-multi-tenant mode, a client connects to a specified JDBCServer node. Compared with multi-active instance mode, the connection string in non-multi-active instance mode does not contain **serviceDiscoveryMode** and **zooKeeperNamespace** parameters about ZooKeeper. + +For example, when you use Beeline client to connect JDBCServer in non-multi-tenant instance mode, run the following command: + +**sh** *CLIENT_HOME*\ **/spark/bin/beeline -u "jdbc:hive2://**\ *:*\ **/;user.principal=spark/hadoop.**\ *@*\ **;saslQop=auth-conf;auth=KERBEROS;principal=spark/hadoop.**\ *@*\ **;"** + +.. note:: + + - **:** indicates the URL of the specified JDBCServer node. + - **CLIENT_HOME** indicates the client path. + +Except the connection method, other operations of JDBCServer API in multi-tenant mode and non-multi-tenant mode are the same. Spark JDBCServer is another implementation of HiveServer2 in Hive. For details about other operations, see official website of Hive at https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients. + +**Specifying a Tenant** + +Generally, the client submitted by a user connects to the default JDBCServer of the tenant to which the user belongs. If you want to connect the client to the JDBCServer of a specified tenant, add the **--hiveconf mapreduce.job.queuename** parameter. + +Command for connecting Beeline is as follows (**aaa** indicates the tenant name): + +**beeline --hiveconf mapreduce.job.queuename=aaa -u 'jdbc:hive2://192.168.39.30:2181,192.168.40.210:2181,192.168.215.97:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.\ \ @\ ;'** diff --git a/umn/source/overview/components/spark2x/spark2x_open_source_new_features.rst b/umn/source/overview/components/spark2x/spark2x_open_source_new_features.rst new file mode 100644 index 0000000..bed4eb0 --- /dev/null +++ b/umn/source/overview/components/spark2x/spark2x_open_source_new_features.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_007106.html + +.. _mrs_08_007106: + +Spark2x Open Source New Features +================================ + +Purpose +------- + +Compared with Spark 1.5, Spark2\ *x* has some new open-source features. The specific features or concepts are as follows: + +- DataSet: For details, see :ref:`SparkSQL and DataSet Principle `. +- Spark SQL Native DDL/DML: For details, see :ref:`SparkSQL and DataSet Principle `. +- SparkSession: For details, see :ref:`SparkSession Principle `. +- Structured Streaming: For details, see :ref:`Structured Streaming Principle `. +- Optimizing Small Files +- Optimizing the Aggregate Algorithm +- Optimizing Datasource Tables +- Merging CBO diff --git a/umn/source/overview/components/storm/index.rst b/umn/source/overview/components/storm/index.rst new file mode 100644 index 0000000..87d7138 --- /dev/null +++ b/umn/source/overview/components/storm/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0014.html + +.. _mrs_08_0014: + +Storm +===== + +- :ref:`Storm Basic Principles ` +- :ref:`Relationship Between Storm and Other Components ` +- :ref:`Storm Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + storm_basic_principles + relationship_between_storm_and_other_components + storm_enhanced_open_source_features diff --git a/umn/source/overview/components/storm/relationship_between_storm_and_other_components.rst b/umn/source/overview/components/storm/relationship_between_storm_and_other_components.rst new file mode 100644 index 0000000..0df2995 --- /dev/null +++ b/umn/source/overview/components/storm/relationship_between_storm_and_other_components.rst @@ -0,0 +1,31 @@ +:original_name: mrs_08_00142.html + +.. _mrs_08_00142: + +Relationship Between Storm and Other Components +=============================================== + +Storm provides a real-time distributed computing framework. It can obtain real-time messages from data sources (such as Kafka and TCP connection), perform high-throughput and low-latency real-time computing on a real-time platform, and export results to message queues or implement data persistence. :ref:`Figure 1 ` shows the relationship between Storm and other components. + +.. _mrs_08_00142__fig_re: + +.. figure:: /_static/images/en-us_image_0000001349190329.png + :alt: **Figure 1** Relationship with other components + + **Figure 1** Relationship with other components + +Relationship between Storm and Streaming +---------------------------------------- + +Both Storm and Streaming use the open source Apache Storm kernel. However, the kernel version used by Storm is 1.2.1 whereas that used by Streaming is 0.10.0. Streaming is used to inherit transition services in upgrade scenarios. For example, if Streaming has been deployed in an earlier version and services are running, Streaming can still be used after the upgrade. Storm is recommended in a new cluster. + +Storm 1.2.1 has the following new features: + +- **Distributed cache**: Provides external resources (configurations) required for sharing and updating the topology using CLI tools. You do not need to re-package and re-deploy the topology. +- **Native Streaming Window API**: Provides window-based APIs. +- **Resource scheduler**: Added the resource scheduler plug-in. When defining a topology, you can specify the maximum resources available and assign resource quotas to users, thus to manage topology resources of the users. +- **State management**: Provides the Bolt API with the checkpoint mechanism. When an event fails, Storm automatically manages the Bolt status and restore the event. +- **Message sampling and debugging**: On the Storm UI, you can enable or disable topology- or component-level debugging to output stream messages to specified logs based on the sampling ratio. +- **Worker dynamic analysis**: On the Storm UI, you can collect jstack and heap logs of the Worker process and restart the Worker process. +- **Dynamic adjustment of topology logs**: You can dynamically change the running topology logs on the CLI or Storm UI. +- **Improved performance**: Compared with earlier versions, the performance of Storm is greatly improved. Although the topology performance is closely related to the use case scenario and dependency on external services, the performance is three times higher in most scenarios. diff --git a/umn/source/overview/components/storm/storm_basic_principles.rst b/umn/source/overview/components/storm/storm_basic_principles.rst new file mode 100644 index 0000000..f76ddcd --- /dev/null +++ b/umn/source/overview/components/storm/storm_basic_principles.rst @@ -0,0 +1,133 @@ +:original_name: mrs_08_00141.html + +.. _mrs_08_00141: + +Storm Basic Principles +====================== + +Apache Storm is a distributed, reliable, and fault-tolerant real-time stream data processing system. In Storm, a graph-shaped data structure called topology needs to be designed first for real-time computing. The topology will be submitted to a cluster. Then a master node in the cluster distributes codes and assigns tasks to worker nodes. A topology contains two roles: spout and bolt. A spout sends messages and sends data streams in tuples. A bolt converts the data streams and performs computing and filtering operations. The bolt can randomly send data to other bolts. Tuples sent by a spout are unchangeable arrays and map to fixed key-value pairs. + + +.. figure:: /_static/images/en-us_image_0000001296750294.png + :alt: **Figure 1** System architecture of Storm + + **Figure 1** System architecture of Storm + +Service processing logic is encapsulated in the topology of Storm. A topology is a set of spout (data sources) and bolt (logical processing) components that are connected using Stream Groupings in DAG mode. All components (spout and bolt) in a topology are working in parallel. In a topology, you can specify the parallelism for each node. Then, Storm allocates tasks in the cluster for computing to improve system processing capabilities. + + +.. figure:: /_static/images/en-us_image_0000001296270862.png + :alt: **Figure 2** Topology + + **Figure 2** Topology + +Storm is applicable to real-time analysis, continuous computing, and distributed extract, transform, and load (ETL). It has the following advantages: + +- Wide applications +- High scalability +- Zero data loss +- High fault tolerance +- Easy to construct and control +- Multi-language support + +Storm is a computing platform and provides Continuous Query Language (CQL) in the service layer to facilitate service implementation. CQL has the following features: + +- Easy to use: The CQL syntax is similar to the SQL syntax. Users who have basic knowledge of SQL can easily learn CQL and use it to develop services. +- Rich functions: In addition to basic expressions provided by SQL, CQL provides functions, such as windows, filtering, and concurrency setting, for stream processing. +- Easy to scale: CQL provides an extension API to support increasingly complex service scenarios. Users can customize the input, output, serialization, and deserialization to meet specific service requirements. +- Easy to debug: CQL provides detailed explanation of error codes, facilitating users to rectify faults. + +For details about Storm architecture and principles, see https://storm.apache.org/. + +Principle +--------- + +- **Basic Concepts** + + .. table:: **Table 1** Concepts + + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Concept | Description | + +==================+============================================================================================================================================================================================================================================================================================================================================================================================================+ + | Tuple | A tuple is an invariable key-value pair used to transfer data. Tuples are created and processed in distributed manner. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stream | A stream is an unbounded sequence of tuples. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Topology | A topology is a real-time application running on the Storm platform. It is a Directed Acyclic Graph (DAG) composed of components. A topology can concurrently run on multiple machines. Each machine runs a part of the DAG. A topology is similar to a MapReduce job. The difference is that the topology is a resident program. Once started, the topology cannot stop unless it is manually terminated. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Spout | A spout is the source of tuples. For example, a spout may read data from a message queue, database, file system, or TCP connection and converts them as tuples, which are processed by the next component. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Bolt | In a Topology, a bolt is a component that receives data and executes specific logic, such as filtering or converting tuples, joining or aggregating streams, and performing statistics and result persistence. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Worker | A Worker is a physical processing in running state in a Topology. Each Worker is a JVM process. Each Topology may be executed by multiple Workers. Each Worker executes a logic subset of the Topology. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task | A task is a spout or bolt thread of a Worker. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stream groupings | A stream grouping specifies the tuple dispatching policies. It instructs the subsequent bolt how to receive tuples. The supported policies include Shuffle Grouping, Fields Grouping, All Grouping, Global Grouping, Non Grouping, and Directed Grouping. | + +------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + :ref:`Figure 3 ` shows a Topology (DAG) consisting of a Spout and Bolt. In the figure, a rectangle indicates a Spout or Bolt, the node in each rectangle indicate tasks, and the lines between tasks indicate streams. + + .. _mrs_08_00141__fig_topo: + + .. figure:: /_static/images/en-us_image_0000001349110525.png + :alt: **Figure 3** Topology + + **Figure 3** Topology + +- **Reliability** + + Storm provides three levels of data reliability: + + - At Most Once: The processed data may be lost, but it cannot be processed repeatedly. This reliability level offers the highest throughput. + - At Least Once: Data may be processed repeatedly to ensure reliable data transmission. If a response is not received within the specified time, the Spout resends the data to Bolts for processing. This reliability level may slightly affect system performance. + - Exactly Once: Data is successfully transmitted without loss or redundancy processing. This reliability level delivers the poorest performance. + + Select the reliability level based on service requirements. For example, for the services requiring high data reliability, use Exactly Once to ensure that data is processed only once. For the services insensitive to data loss, use other levels to improve system performance. + +- **Fault Tolerance** + + Storm is a fault-tolerant system that offers high availability. :ref:`Table 2 ` describes the fault tolerance of the Storm components. + + .. _mrs_08_00141__table_04: + + .. table:: **Table 2** Fault tolerance + + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario | Description | + +===================+==================================================================================================================================================================================================================================================================+ + | Nimbus failed | Nimbus is fail-fast and stateless. If the active Nimbus is faulty, the standby Nimbus takes over services immediately, and provide external services. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Supervisor failed | Supervisor is a background daemon of Workers. It is fail-fast and stateless. If a Supervisor is faulty, the Workers running on the node are not affected but cannot receive new tasks. The OMS can detect the fault of the Supervisor and restart the processes. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Worker failed | If a Worker is faulty, the Supervisor on the Worker will restart it again. If the restart fails for multiple times, Nimbus reassigns tasks to other nodes. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Node failed | If a node is faulty, all the tasks being processed by the node time out and Nimbus will assign the tasks to another node for processing. | + +-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Open Source Features +-------------------- + +- Distributed real-time computing + + In a Storm cluster, each machine supports the running of multiple work processes and each work process can create multiple threads. Each thread can execute multiple tasks. A task indicates concurrent data processing. + +- High fault tolerance + + During message processing, if a node or a process is faulty, the message processing unit can be redeployed. + +- Reliable messages + + Data processing methods including At-Least Once, At-Most Once, and Exactly Once are supported. + +- Security mechanism + + Storm provides Kerberos-based authentication and pluggable authorization mechanisms, supports SSL Storm UI and Log Viewer UI, and supports security integration with other big data platform components (such as ZooKeeper and HDFS). + +- Flexible topology defining and deployment + + The Flux framework is used to define and deploy service topologies. If the service DAG is changed, users only need to modify YAML domain specific language (DSL), but do not need to recompile or package service code. + +- Integration with external components + + Storm supports integration with multiple external components such as Kafka, HDFS, HBase, Redis, and JDBC/RDBMS, implementing services that involve multiple data sources. diff --git a/umn/source/overview/components/storm/storm_enhanced_open_source_features.rst b/umn/source/overview/components/storm/storm_enhanced_open_source_features.rst new file mode 100644 index 0000000..a47e5d0 --- /dev/null +++ b/umn/source/overview/components/storm/storm_enhanced_open_source_features.rst @@ -0,0 +1,14 @@ +:original_name: mrs_08_00143.html + +.. _mrs_08_00143: + +Storm Enhanced Open Source Features +=================================== + +- CQL + + Continuous Query Language (CQL) is an SQL-like language used for real-time stream processing. Compared with SQL, CQL has introduced the concept of (time-sequencing) window, which allows data to be stored and processed in the memory. The CQL output is the computing results of data streams at specific time. The use of CQL accelerates service development, enables tasks to be easily submitted to the Storm platform for real-time processing, facilitates output of results, and allows tasks to be terminated at the appropriate time. + +- High Availability + + Nimbus HA ensures continuous service processing such as adding topologies and management even if one Nimbus is faulty, improving cluster availability. diff --git a/umn/source/overview/components/tez.rst b/umn/source/overview/components/tez.rst new file mode 100644 index 0000000..a8de303 --- /dev/null +++ b/umn/source/overview/components/tez.rst @@ -0,0 +1,30 @@ +:original_name: mrs_08_0030.html + +.. _mrs_08_0030: + +Tez +=== + +Tez is Apache's latest open source computing framework that supports Directed Acyclic Graph (DAG) jobs. It can convert multiple dependent jobs into one job, greatly improving the performance of DAG jobs. If projects like Hive and `Pig `__ use Tez instead of MapReduce as the backbone of data processing, response time will be significantly reduced. Tez is built on YARN and can run MapReduce jobs without any modification. + +MRS uses Tez as the default execution engine of Hive. Tez remarkably surpasses the original MapReduce computing engine in terms of execution efficiency. + +For details about Tez, see https://tez.apache.org/. + +Relationship Between Tez and MapReduce +-------------------------------------- + +Tez uses a DAG to organize MapReduce tasks. In the DAG, a node is an RDD, and an edge indicates an operation on the RDD. The core idea is to further split Map tasks and Reduce tasks. A Map task is split into the Input-Processor-Sort-Merge-Output tasks, and the Reduce task is split into the Input-Shuffle-Sort-Merge-Process-output tasks. Tez flexibly regroups several small tasks to form a large DAG job. + + +.. figure:: /_static/images/en-us_image_0000001349390665.png + :alt: **Figure 1** Processes for submitting tasks using Hive on MapReduce and Hive on Tez + + **Figure 1** Processes for submitting tasks using Hive on MapReduce and Hive on Tez + +A Hive on MapReduce task contains multiple MapReduce tasks. Each task stores intermediate results to HDFS. The reducer in the previous step provides data for the mapper in the next step. A Hive on Tez task can complete the same processing process in only one task, and HDFS does not need to be accessed between tasks. + +Relationship Between Tez and Yarn +--------------------------------- + +Tez is a computing framework running on Yarn. The runtime environment consists of ResourceManager and ApplicationMaster of Yarn. ResourceManager is a brand new resource manager system, and ApplicationMaster is responsible for cutting MapReduce job data, assigning tasks, applying for resources, scheduling tasks, and tolerating faults. In addition, TezUI depends on TimelineServer provided by Yarn to display the running process of Tez tasks. diff --git a/umn/source/overview/components/yarn/index.rst b/umn/source/overview/components/yarn/index.rst new file mode 100644 index 0000000..06b3986 --- /dev/null +++ b/umn/source/overview/components/yarn/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_08_0051.html + +.. _mrs_08_0051: + +Yarn +==== + +- :ref:`Yarn Basic Principles ` +- :ref:`Yarn HA Solution ` +- :ref:`Relationship Between YARN and Other Components ` +- :ref:`Yarn Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + yarn_basic_principles + yarn_ha_solution + relationship_between_yarn_and_other_components + yarn_enhanced_open_source_features diff --git a/umn/source/overview/components/yarn/relationship_between_yarn_and_other_components.rst b/umn/source/overview/components/yarn/relationship_between_yarn_and_other_components.rst new file mode 100644 index 0000000..73efb88 --- /dev/null +++ b/umn/source/overview/components/yarn/relationship_between_yarn_and_other_components.rst @@ -0,0 +1,94 @@ +:original_name: mrs_08_00513.html + +.. _mrs_08_00513: + +Relationship Between YARN and Other Components +============================================== + +Relationship Between YARN and Spark +----------------------------------- + +The Spark computing and scheduling can be implemented using YARN mode. Spark enjoys the compute resources provided by YARN clusters and runs tasks in a distributed way. Spark on YARN has two modes: YARN-cluster and YARN-client. + +- YARN Cluster mode + + :ref:`Figure 1 ` describes the operation framework. + + .. _mrs_08_00513__f4a05be79f46946c490c811dadbff1ab5: + + .. figure:: /_static/images/en-us_image_0000001349110449.png + :alt: **Figure 1** Spark on YARN-cluster operation framework + + **Figure 1** Spark on YARN-cluster operation framework + + Spark on YARN-cluster implementation process: + + #. The client generates the application information, and then sends the information to ResourceManager. + + #. ResourceManager allocates the first container (ApplicationMaster) to SparkApplication and starts the driver on the container. + + #. ApplicationMaster applies for resources from ResourceManager to run the container. + + ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers and starts the executor in the obtained container. After the executor is started, it registers with drivers and applies for tasks. + + #. Drivers allocate tasks to the executors. + + #. Executors run tasks and report the operating status to Drivers. + +- YARN Client mode + + :ref:`Figure 2 ` describes the operation framework. + + .. _mrs_08_00513__f262883214bcd497c9b530206e64b0395: + + .. figure:: /_static/images/en-us_image_0000001296750218.png + :alt: **Figure 2** Spark on YARN-client operation framework + + **Figure 2** Spark on YARN-client operation framework + + Spark on YARN-client implementation process: + + .. note:: + + In YARN-client mode, the driver is deployed and started on the client. In YARN-client mode, the client of an earlier version is incompatible. You are advised to use the YARN-cluster mode. + + #. The client sends the Spark application request to ResourceManager, then ResourceManager returns the results. The results include information such as Application ID and the maximum and minimum available resources. The client packages all information required to start ApplicationMaster, and sends the information to ResourceManager. + + #. After receiving the request, ResourceManager finds a proper node for ApplicationMaster and starts it on this node. ApplicationMaster is a role in YARN, and the process name in Spark is ExecutorLauncher. + + #. Based on the resource requirements of each task, ApplicationMaster can apply for a series of containers to run tasks from ResourceManager. + + #. After receiving the newly allocated container list (from ResourceManager), ApplicationMaster sends information to the related NodeManagers to start the containers. + + ResourceManager allocates the containers to ApplicationMaster, which communicates with the related NodeManagers and starts the executor in the obtained container. After the executor is started, it registers with drivers and applies for tasks. + + .. note:: + + Running containers are not suspended and resources are not released. + + #. Drivers allocate tasks to the executors. Executors run tasks and report the operating status to Drivers. + +Relationship Between YARN and MapReduce +--------------------------------------- + +MapReduce is a computing framework running on YARN, which is used for batch processing. MRv1 is implemented based on MapReduce in Hadoop 1.0, which is composed of programming models (new and old programming APIs), running environment (JobTracker and TaskTracker), and data processing engine (MapTask and ReduceTask). This framework is still weak in scalability, fault tolerance (JobTracker SPOF), and compatibility with multiple frameworks. (Currently, only the MapReduce computing framework is supported.) MRv2 is implemented based on MapReduce in Hadoop 2.0. The source code reuses MRv1 programming models and data processing engine implementation, and the running environment is composed of ResourceManager and ApplicationMaster. ResourceManager is a brand new resource manager system, and ApplicationMaster is responsible for cutting MapReduce job data, assigning tasks, applying for resources, scheduling tasks, and tolerating faults. + +Relationship Between YARN and ZooKeeper +--------------------------------------- + +:ref:`Figure 3 ` shows the relationship between ZooKeeper and YARN. + +.. _mrs_08_00513__fdf7cc662abc147c6b9ea58bc752662cd: + +.. figure:: /_static/images/en-us_image_0000001349309905.png + :alt: **Figure 3** Relationship Between ZooKeeper and YARN + + **Figure 3** Relationship Between ZooKeeper and YARN + +#. When the system is started, ResourceManager attempts to write state information to ZooKeeper. ResourceManager that first writes state information to ZooKeeper is selected as the active ResourceManager, and others are standby ResourceManagers. The standby ResourceManagers periodically monitor active ResourceManager election information in ZooKeeper. +#. The active ResourceManager creates the **Statestore** directory in ZooKeeper to store application information. If the active ResourceManager is faulty, the standby ResourceManager obtains application information from the **Statestore** directory and restores the data. + +Relationship Between YARN and Tez +--------------------------------- + +The Hive on Tez job information requires the TimeLine Server capability of YARN so that Hive tasks can display the current and historical status of applications, facilitating storage and retrieval. diff --git a/umn/source/overview/components/yarn/yarn_basic_principles.rst b/umn/source/overview/components/yarn/yarn_basic_principles.rst new file mode 100644 index 0000000..1368543 --- /dev/null +++ b/umn/source/overview/components/yarn/yarn_basic_principles.rst @@ -0,0 +1,96 @@ +:original_name: mrs_08_00511.html + +.. _mrs_08_00511: + +Yarn Basic Principles +===================== + +The Apache open source community introduces the unified resource management framework `Yarn `__ to share Hadoop clusters, improve their scalability and reliability, and eliminate a performance bottleneck of JobTracker in the early MapReduce framework. + +The fundamental idea of Yarn is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). + +.. note:: + + An application is either a single job in the classical sense of MapReduce jobs or a Directed Acyclic Graph (DAG) of jobs. + +Architecture +------------ + +ResourceManager is the essence of the layered structure of Yarn. This entity controls an entire cluster and manages the allocation of applications to underlying compute resources. The ResourceManager carefully allocates various resources (compute, memory, bandwidth, and so on) to underlying NodeManagers (Yarn's per-node agents). The ResourceManager also works with ApplicationMasters to allocate resources, and works with the NodeManagers to start and monitor their underlying applications. In this context, the ApplicationMaster has taken some of the role of the prior TaskTracker, and the ResourceManager has taken the role of the JobTracker. + +ApplicationMaster manages each instance of an application running in Yarn. The ApplicationMaster negotiates resources from the ResourceManager and works with the NodeManagers to monitor container execution and resource usage (CPU and memory resource allocation). + +The NodeManager manages each node in a Yarn cluster. The NodeManager provides per-node services in a cluster, from overseeing the management of a container over its lifecycle to monitoring resources and tracking the health of its nodes. MRv1 manages execution of the Map and Reduce tasks through slots, whereas the NodeManager manages abstract containers, which represent per-node resources available for a particular application. + +.. _mrs_08_00511__fig54968318273: + +.. figure:: /_static/images/en-us_image_0000001296750230.png + :alt: **Figure 1** Architecture + + **Figure 1** Architecture + +:ref:`Table 1 ` describes the components shown in :ref:`Figure 1 `. + +.. _mrs_08_00511__table8760153384813: + +.. table:: **Table 1** Architecture description + + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +=======================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Client | Client of a Yarn application. You can submit a task to ResourceManager and query the operating status of an application using the client. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ResourceManager(RM) | RM centrally manages and allocates all resources in the cluster. It receives resource reporting information from each node (NodeManager) and allocates resources to applications on the basis of the collected resources according a specified policy. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | NodeManager(NM) | NM is the agent on each node of Yarn. It manages the computing node in Hadoop cluster, establishes communication with ResourceManger, monitors the lifecycle of containers, monitors the usage of resources such as memory and CPU of each container, traces node health status, and manages logs and auxiliary services used by different applications. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ApplicationMaster(AM) | AM (App Mstr in the figure above) is responsible for all tasks through the lifcycle of in an application. The tasks include the following: Negotiate with an RM scheduler to obtain a resource; further allocate the obtained resources to internal tasks (secondary allocation of resources); communicate with the NM to start or stop tasks; monitor the running status of all tasks; and apply for resources for tasks again to restart the tasks when the tasks fail to be executed. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Container | A resource abstraction in Yarn. It encapsulates multi-dimensional resources (including only memory and CPU) on a certain node. When ApplicationMaster applies for resources from ResourceManager, the ResourceManager returns resources to the ApplicationMaster in a container. Yarn allocates one container for each task and the task can only use the resources encapsulated in the container. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +In Yarn, resource schedulers organize resources through hierarchical queues. This ensures that resources are allocated and shared among queues, thereby improving the usage of cluster resources. The core resource allocation model of Superior Scheduler is the same as that of Capacity Scheduler, as shown in the following figure. + +A scheduler maintains queue information. You can submit applications to one or more queues. During each NM heartbeat, the scheduler selects a queue according to a specific scheduling rule, selects an application in the queue, and then allocates resources to the application. If resources fail to be allocated to the application due to the limit of some parameters, the scheduler will select another application. After the selection, the scheduler processes the resource request of this application. The scheduler gives priority to the requests for local resources first, and then for resources on the same rack, and finally for resources from any machine. + + +.. figure:: /_static/images/en-us_image_0000001296590614.png + :alt: **Figure 2** Resource allocation model + + **Figure 2** Resource allocation model + +Principle +--------- + +The new Hadoop MapReduce framework is named MRv2 or Yarn. Yarn consists of ResourceManager, ApplicationMaster, and NodeManager. + +- ResourceManager is a global resource manager that manages and allocates resources in the system. ResourceManager consists of Scheduler and Applications Manager. + + - Scheduler allocates system resources to all running applications based on the restrictions such as capacity and queue (for example, allocates a certain amount of resources for a queue and executes a specific number of jobs). It allocates resources based on the demand of applications, with container being used as the resource allocation unit. Functioning as a dynamic resource allocation unit, Container encapsulates memory, CPU, disk, and network resources, thereby limiting the resource consumed by each task. In addition, the Scheduler is a pluggable component. You can design new schedulers as required. Yarn provides multiple directly available schedulers, such as Fair Scheduler and Capacity Scheduler. + - Applications Manager manages all applications in the system and involves submitting applications, negotiating with schedulers about resources, enabling and monitoring ApplicationMaster, and restarting ApplicationMaster upon the startup failure. + +- NodeManager is the resource and task manager of each node. On one hand, NodeManager periodically reports resource usage of the local node and the running status of each Container to ResourceManager. On the other hand, NodeManager receives and processes requests from ApplicationMaster for starting or stopping Containers. +- ApplicationMaster is responsible for all tasks through the lifecycle of an application, these channels include the following: + + - Negotiate with the RM scheduler to obtain resources. + + - Assign resources to internal components (secondary allocation of resources). + + - Communicates with NodeManager to start or stop tasks. + + - Monitor the running status of all tasks, and applies for resources again for tasks when tasks fail to run to restart the tasks. + +Capacity Scheduler Principle +---------------------------- + +Capacity Scheduler is a multi-user scheduler. It allocates resources by queue and sets the minimum/maximum resources that can be used for each queue. In addition, the upper limit of resource usage is set for each user to prevent resource abuse. Remaining resources of a queue can be temporarily shared with other queues. + +Capacity Scheduler supports multiple queues. It configures a certain amount of resources for each queue and adopts the first-in-first-out queuing (FIFO) scheduling policy. To prevent one user's applications from exclusively using the resources in a queue, Capacity Scheduler sets a limit on the number of resources used by jobs submitted by one user. During scheduling, Capacity Scheduler first calculates the number of resources required for each queue, and selects the queue that requires the least resources. Then, it allocates resources based on the job priority and time that jobs are submitted as well as the limit on resources and memory. Capacity Scheduler supports the following features: + +- Guaranteed capacity: As the administrator, you can set the lower and upper limits of resource usage for each queue. All applications submitted to this queue share the resources. +- High flexibility: Temporarily, the remaining resources of a queue can be shared with other queues. However, such resources must be released in case of new application submission to the queue. Such flexible resource allocation helps notably improve resource usage. +- Multi-tenancy: Multiple users can share a cluster, and multiple applications can run concurrently. To avoid exclusive resource usage by a single application, user, or queue, the administrator can add multiple constraints (for example, limit on concurrent tasks of a single application). +- Assured protection: An ACL list is provided for each queue to strictly limit user access. You can specify the users who can view your application status or control the applications. Additionally, the administrator can specify a queue administrator and a cluster system administrator. +- Dynamic update of configuration files: Administrators can dynamically modify configuration parameters to manage clusters online. + +Each queue in Capacity Scheduler can limit the resource usage. However, the resource usage of a queue determines its priority when resources are allocated to queues, indicating that queues with smaller capacity are competitive. If the throughput of a cluster is big, delay scheduling enables an application to give up cross-machine or cross-rack scheduling, and to request local scheduling. diff --git a/umn/source/overview/components/yarn/yarn_enhanced_open_source_features.rst b/umn/source/overview/components/yarn/yarn_enhanced_open_source_features.rst new file mode 100644 index 0000000..49c4b09 --- /dev/null +++ b/umn/source/overview/components/yarn/yarn_enhanced_open_source_features.rst @@ -0,0 +1,205 @@ +:original_name: mrs_08_00514.html + +.. _mrs_08_00514: + +Yarn Enhanced Open Source Features +================================== + +Priority-based task scheduling +------------------------------ + +In the native Yarn resource scheduling mechanism, if the whole Hadoop cluster resources are occupied by those MapReduce jobs submitted earlier, jobs submitted later will be kept in pending state until all running jobs are executed and resources are released. + +The MRS cluster provides the task priority scheduling mechanism. With this feature, you can define jobs of different priorities. Jobs of high priority can preempt resources released from jobs of low priority though the high-priority jobs are submitted later. The low-priority jobs that are not started will be suspended unless those jobs of high priority are completed and resources are released, then they can properly be started. + +This feature enables services to control computing jobs more flexibly, thereby achieving higher cluster resource utilization. + +.. note:: + + Container reuse is in conflict with task priority scheduling. If container reuse is enabled, resources are being occupied, and task priority scheduling does not take effect. + +Yarn Permission Control +----------------------- + +The permission mechanism of Hadoop Yarn is implemented through ACLs. The following describes how to grant different permission control to different users: + +- Admin ACL + + An O&M administrator is specified for the Yarn cluster. The Admin ACL is determined by **yarn.admin.acl**. The cluster O&M administrator can access the ResourceManager web UI and operate NodeManager nodes, queues, and NodeLabel, **but cannot submit tasks**. + +- Queue ACL + + To facilitate user management in the cluster, users or user groups are divided into several queues to which each user and user group belongs. Each queue contains permissions to submit and manage applications (for example, terminate any application). + +Open source functions: + +Currently, Yarn supports the following roles for users: + +- Cluster O&M administrator +- Queue administrator +- Common user + +However, the APIs (such as the web UI, REST API, and Java API) provided by Yarn do not support role-specific permission control. Therefore, all users have the permission to access the application and cluster information, which does not meet the isolation requirements in the multi-tenant scenario. + +This is an enhanced function. + +In security mode, permission management is enhanced for the APIs such as web UI, REST API, and Java API provided by Yarn. Permission control can be performed based on user roles. + +Role-based permissions are as follows: + +- Cluster O&M administrator: performs management operations in the Yarn cluster, such as accessing the ResourceManager web UI, refreshing queues, setting NodeLabel, and performing active/standby switchover. +- Queue administrator: has the permission to modify and view queues managed by the Yarn cluster. +- Common user: has the permission to modify and view self-submitted applications in the Yarn cluster. + +Superior Scheduler Principle (Self-developed) +--------------------------------------------- + +Superior Scheduler is a scheduling engine designed for the Hadoop Yarn distributed resource management system. It is a high-performance and enterprise-level scheduler designed for converged resource pools and multi-tenant service requirements. + +Superior Scheduler achieves all functions of open source schedulers, Fair Scheduler, and Capacity Scheduler. Compared with the open source schedulers, Superior Scheduler is enhanced in the enterprise multi-tenant resource scheduling policy, resource isolation and sharing among users in a tenant, scheduling performance, system resource usage, and cluster scalability. Superior Scheduler is designed to replace open source schedulers. + +Similar to open source Fair Scheduler and Capacity Scheduler, Superior Scheduler follows the Yarn scheduler plugin API to interact with Yarn ResourceManager to offer resource scheduling functionalities. :ref:`Figure 1 ` shows the overall system diagram. + +.. _mrs_08_00514__f6e2f095c1ef043d0ba716e26731052b3: + +.. figure:: /_static/images/en-us_image_0000001296430794.jpg + :alt: **Figure 1** Internal architecture of Superior Scheduler + + **Figure 1** Internal architecture of Superior Scheduler + +In :ref:`Figure 1 `, Superior Scheduler consists of the following modules: + +- Superior Scheduler Engine is a high performance scheduler engine with rich scheduling policies. + +- Superior Yarn Scheduler Plugin functions as a bridge between Yarn ResourceManager and Superior Scheduler Engine and interacts with Yarn ResourceManager. + + The scheduling principle of open source schedulers is that resources match jobs based on the heartbeats of computing nodes. Specifically, each computing node periodically sends heartbeat messages to ResourceManager of Yarn to notify the node status and starts the scheduler to assign jobs to the node itself. In this scheduling mechanism, the scheduling period depends on the heartbeat. If the cluster scale increases, bottleneck on system scalability and scheduling performance may occur. In addition, because resources match jobs, the scheduling accuracy of an open source scheduler is limited. For example, data affinity is random and the system does not support load-based scheduling policies. The scheduler may not make the best choice due to lack of the global resource view when selecting jobs. + + Superior Scheduler adopts multiple scheduling mechanisms. There are dedicated scheduling threads in Superior Scheduler, separating heartbeats with scheduling and preventing system heartbeat storms. Additionally, Superior Scheduler matches jobs with resources, providing each scheduled job with a global resource view and increasing the scheduling accuracy. Compared with the open source scheduler, Superior Scheduler excels in system throughput, resource usage, and data affinity. + + +.. figure:: /_static/images/en-us_image_0000001349110493.png + :alt: **Figure 2** Comparison of Superior Scheduler with open source schedulers + + **Figure 2** Comparison of Superior Scheduler with open source schedulers + +Apart from the enhanced system throughput and utilization, Superior Scheduler provides following major scheduling features: + +- Multiple resource pools + + Multiple resource pools help logically divide cluster resources and share them among multiple tenants or queues. The division of resource pools supports heterogeneous resources. Resource pools can be divided exactly according to requirements on the application resource isolation. You can configure further policies for different queues in a pool. + +- Multi-tenant scheduling (**reserve**, **min**, **share**, and **max**) in each resource pool + + Superior Scheduler provides flexible hierarchical multi-tenant scheduling policy. Different policies can be configured for different tenants or queues that can access different resource pools. The following figure lists supported policies: + + .. table:: **Table 1** Policy description + + +---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +=========+=================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | reserve | This policy is used to reserve resources for a tenant. Even though tenant has no jobs available, other tenant cannot use the reserved resource. The value can be a percentage or an absolute value. If both the percentage and absolute value are configured, the percentage is automatically calculated into an absolute value, and the larger value is used. The default **reserve** value is **0**. Compared with the method of specifying a dedicated resource pool and hosts, the **reserve** policy provides a flexible floating reservation function. In addition, because no specific hosts are specified, the data affinity for calculation is improved and the impact by the faulty hosts is avoided. | + +---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | min | This policy allows preemption of minimum resources. Other tenants can use these resources, but the current tenant has the priority to use them. The value can be a percentage or an absolute value. If both the percentage and absolute value are configured, the percentage is automatically calculated into an absolute value, and the larger value is used. The default value is **0**. | + +---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | share | This policy is used for shared resources that cannot be preempted. To use these resources, the current tenant needs to wait for other tenants to complete jobs and release resources. The value can be a percentage or an absolute value. | + +---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | max | This policy is used for the maximum resources that can be utilized. The tenant cannot obtain more resources than the allowed maximum value. The value can be a percentage or an absolute value. If both the percentage and absolute value are configured, the percentage is automatically calculated into an absolute value, and the larger value is used. By default value, there is no restriction on resources. | + +---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + :ref:`Figure 3 ` shows the tenant resource allocation policy. + + .. _mrs_08_00514__fe57392ff64944567bd0cb78bbc2aaeeb: + + .. figure:: /_static/images/en-us_image_0000001296590646.jpg + :alt: **Figure 3** Resource scheduling policies + + **Figure 3** Resource scheduling policies + + .. note:: + + In the above figure, **Total** indicates the total number of resources, not the scheduling policy. + + Compared with open source schedulers, Superior Scheduler supports both percentage and absolute value of tenants for allocating resources, flexibly addressing resource scheduling requirements of enterprise-level tenants. For example, resources can be allocated according to the absolute value of level-1 tenants, avoiding impact caused by changes of cluster scale. However, resources can be allocated according to the allocation percentage of sub-tenants, improving resource usages in the level-1 tenant. + +- Heterogeneous and multi-dimensional resource scheduling + + Superior Scheduler supports following functions except CPU and memory scheduling: + + - `Node labels `__ can be used to identify multi-dimensional attributes of nodes such as **GPU_ENABLED** and **SSD_ENBALED**, and can be scheduled based on these labels. + - Resource pools can be used to group resources of the same type and allocate them to specific tenants or queues. + +- Fair scheduling of multiple users in a tenant + + In a leaf tenant, multiple users can use the same queue to submit jobs. Compared with the open source schedulers, Superior Scheduler supports configuring flexible resource sharing policy among different users in a same tenant. For example, VIP users can be configured with higher resource access weight. + +- Data locality aware scheduling + + Superior Scheduler adopts the job-to-node scheduling policy. That is, Superior Scheduler attempts to schedule specified jobs between available nodes so that the selected node is suitable for the specified jobs. By doing so, the scheduler will have an overall view of the cluster and data. Localization is ensured if there is an opportunity to place tasks closer to the data. The open source scheduler uses the node-to-job scheduling policy to match the appropriate jobs to a given node. + +- Dynamic resource reservation during container scheduling + + In a heterogeneous and diversified computing environment, some containers need more resources or multiple resources. For example, Spark job may require large memory. When such containers compete with containers requiring fewer resources, containers requiring more resources may not obtain sufficient resources within a reasonable period. Open source schedulers allocate resources to jobs, which may cause unreasonable resource reservation for these jobs. This mechanism leads to the waste of overall system resources. Superior Scheduler differs from open source schedulers in following aspects: + + - Requirement-based matching: Superior Scheduler schedules jobs to nodes and selects appropriate nodes to reserve resources to improve the startup time of containers and avoid waste. + - Tenant rebalancing: When the reservation logic is enabled, the open source schedulers do not comply with the configured sharing policy. Superior Scheduler uses different methods. In each scheduling period, Superior Scheduler traverses all tenants and attempts to balance resources based on the multi-tenant policy. In addition, Superior Scheduler attempts to meet all policies (**reserve**, **min**, and **share**) to release reserved resources and direct available resources to other containers that should obtain resources under different tenants. + +- Dynamic queue status control (**Open**/**Closed**/**Active**/**Inactive**) + + Multiple queue statuses are supported, helping administrators operate and maintain multiple tenants. + + - Open status (**Open/Closed**): If the status is **Open** by default, applications submitted to the queue are accepted. If the status is **Closed**, no application is accepted. + - Active status (**Active/Inactive**): If the status is **Active** by default, resources can be scheduled and allocated to applications in the tenant. Resources will not be scheduled to queues in **Inactive** status. + +- Application pending reason + + If the application is not started, provide the job pending reasons. + +:ref:`Table 2 ` describes the comparison result of Superior Scheduler and Yarn open source schedulers. + +.. _mrs_08_00514__t233dcb46b3d246b38e43771ca2810229: + +.. table:: **Table 2** Comparative analysis + + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scheduling | Yarn Open Source Scheduler | Superior Scheduler | + +===============================================+==============================================================================================================================================================================================================================================================+================================================================================================================================================+ + | Multi-tenant scheduling | In homogeneous clusters, either Capacity Scheduler or Fair Scheduler can be selected and the cluster does not support Fair Scheduler. Capacity Scheduler supports the scheduling by percentage and Fair Scheduler supports the scheduling by absolute value. | - Supports heterogeneous clusters and multiple resource pools. | + | | | - Supports **reservation** to ensure direct access to resources. | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data locality aware scheduling | The node-to-job scheduling policy reduces the success rate of data localization and potentially affects application execution performance. | The **job-to-node scheduling policy** can aware data location more accurately, and the job hit rate of data localization scheduling is higher. | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Balanced scheduling based on load of hosts | Not supported | **Balanced scheduling can be achieved when Superior Scheduler considers the host load and resource allocation during scheduling.** | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Fair scheduling of multiple users in a tenant | Not supported | Supports keywords **default** and **others**. | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Job waiting reason | Not supported | Job waiting reasons illustrate why a job needs to wait. | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + +In conclusion, Superior Scheduler is a high-performance scheduler with various scheduling policies and is better than Capacity Scheduler in terms of functionality, performance, resource usage, and scalability. + +CPU Hard Isolation +------------------ + +Yarn cannot strictly control the CPU resources used by each container. When the CPU subsystem is used, a container may occupy excessive resources. Therefore, CPUset is used to control resource allocation. + +To solve this problem, the CPU resources are allocated to each container based on the ratio of virtual cores (vCores) to physical cores. If a container requires an entire physical core, the container has it. If a container needs only some physical cores, several containers may share the same physical core. The following figure shows an example of the CPU quota. The given ratio of vCores to physical cores is 2:1. + + +.. figure:: /_static/images/en-us_image_0000001296750262.png + :alt: **Figure 4** CPU quota + + **Figure 4** CPU quota + +Enhanced Open Source Feature: Optimizing Restart Performance +------------------------------------------------------------ + +Generally, the recovered ResourceManager can obtain running and completed applications. However, a large number of completed applications may cause problems such as slow startup and long HA switchover/restart time of ResourceManagers. + +To speed up the startup, obtain the list of unfinished applications before starting the ResourceManagers. In this case, the completed application continues to be recovered in the background asynchronous thread. The following figure shows how the ResourceManager recovery starts. + + +.. figure:: /_static/images/en-us_image_0000001349190361.jpg + :alt: **Figure 5** Starting the ResourceManager recovery + + **Figure 5** Starting the ResourceManager recovery diff --git a/umn/source/overview/components/yarn/yarn_ha_solution.rst b/umn/source/overview/components/yarn/yarn_ha_solution.rst new file mode 100644 index 0000000..0302bd5 --- /dev/null +++ b/umn/source/overview/components/yarn/yarn_ha_solution.rst @@ -0,0 +1,32 @@ +:original_name: mrs_08_00512.html + +.. _mrs_08_00512: + +Yarn HA Solution +================ + +HA Principles and Implementation Solution +----------------------------------------- + +ResourceManager in Yarn manages resources and schedules tasks in the cluster. In versions earlier than Hadoop 2.4, SPOFs may occur on ResourceManager in the Yarn cluster. The Yarn HA solution uses redundant ResourceManager nodes to tackle challenges of service reliability and fault tolerance. + +.. _mrs_08_00512__fe00424623c8c4584ad4fa8e1de6d0ab9: + +.. figure:: /_static/images/en-us_image_0000001296430826.png + :alt: **Figure 1** ResourceManager HA architecture + + **Figure 1** ResourceManager HA architecture + +ResourceManager HA is achieved using active-standby ResourceManager nodes, as shown in :ref:`Figure 1 `. Similar to the HDFS HA solution, the ResourceManager HA allows only one ResourceManager node to be in the active state at any time. When the active ResourceManager fails, the active-standby switchover can be triggered automatically or manually. + +When the automatic failover function is not enabled, after the Yarn cluster is enabled, administrators need to run the **yarn rmadmin** command to manually switch one of the ResourceManager nodes to the active state. Upon a planned maintenance event or a fault, they are expected to first demote the active ResourceManager to the standby state and the standby ResourceManager promote to the active state. + +When the automatic switchover is enabled, a built-in ActiveStandbyElector that is based on ZooKeeper decide which ResourceManager node should be the active one. When the active ResourceManager is faulty, another ResourceManager node is automatically selected to be the active one to take over the faulty node. + +When ResourceManager nodes in the cluster are deployed in HA mode, the configuration **yarn-site.xml** used by clients needs to list all the ResourceManager nodes. The client (including ApplicationMaster and NodeManager) searches for the active ResourceManager in polling mode. That is, the client needs to provide the fault tolerance mechanism. If the active ResourceManager cannot be connected with, the client continuously searches for a new one in polling mode. + +After the standby ResourceManager promotes to be the active one, the upper-layer applications can recover to their status when the fault occurs. (For details, see `ResourceManger Restart `__.) When ResourceManager Restart is enabled, the restarted ResourceManager node loads the information of the previous active ResourceManager node, and takes over container status information on all NodeManager nodes to continue service running. In this way, status information can be saved by periodically executing checkpoint operations, avoiding data loss. Ensure that both active and standby ResourceManager nodes can access the status information. Currently, three methods are provided for sharing status information by file system (FileSystemRMStateStore), LevelDB database (LeveldbRMStateStore), and ZooKeeper (ZKRMStateStore). Among them, only ZKRMStateStore supports the Fencing mechanism. By default, Hadoop uses ZKRMStateStore. + +For more information about the Yarn HA solution, visit the following website: + +http://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html diff --git a/umn/source/overview/components/zookeeper/index.rst b/umn/source/overview/components/zookeeper/index.rst new file mode 100644 index 0000000..1323078 --- /dev/null +++ b/umn/source/overview/components/zookeeper/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_08_0070.html + +.. _mrs_08_0070: + +ZooKeeper +========= + +- :ref:`ZooKeeper Basic Principle ` +- :ref:`Relationship Between ZooKeeper and Other Components ` +- :ref:`ZooKeeper Enhanced Open Source Features ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + zookeeper_basic_principle + relationship_between_zookeeper_and_other_components + zookeeper_enhanced_open_source_features diff --git a/umn/source/overview/components/zookeeper/relationship_between_zookeeper_and_other_components.rst b/umn/source/overview/components/zookeeper/relationship_between_zookeeper_and_other_components.rst new file mode 100644 index 0000000..a1e6750 --- /dev/null +++ b/umn/source/overview/components/zookeeper/relationship_between_zookeeper_and_other_components.rst @@ -0,0 +1,69 @@ +:original_name: mrs_08_00702.html + +.. _mrs_08_00702: + +Relationship Between ZooKeeper and Other Components +=================================================== + +Relationship Between ZooKeeper and HDFS +--------------------------------------- + +:ref:`Figure 1 ` shows the relationship between ZooKeeper and HDFS. + +.. _mrs_08_00702__f4e063184638a4990ada3505cd92cf4e6: + +.. figure:: /_static/images/en-us_image_0000001349190409.png + :alt: **Figure 1** Relationship between ZooKeeper and HDFS + + **Figure 1** Relationship between ZooKeeper and HDFS + +As the client of a ZooKeeper cluster, ZKFailoverController (ZKFC) monitors the status of NameNode. ZKFC is deployed only in the node where NameNode resides, and in both the active and standby HDFS NameNodes. + +#. The ZKFC connects to ZooKeeper and saves information such as host names to ZooKeeper under the znode directory **/hadoop-ha**. NameNode that creates the directory first is considered as the active node, and the other is the standby node. NameNodes read the NameNode information periodically through ZooKeeper. +#. When the process of the active node ends abnormally, the standby NameNode detects changes in the **/hadoop-ha** directory through ZooKeeper, and then takes over the service of the active NameNode. + +Relationship Between ZooKeeper and YARN +--------------------------------------- + +:ref:`Figure 2 ` shows the relationship between ZooKeeper and YARN. + +.. _mrs_08_00702__fde9fecb356f3437c94d75e7544ed0cc3: + +.. figure:: /_static/images/en-us_image_0000001296750310.png + :alt: **Figure 2** Relationship Between ZooKeeper and YARN + + **Figure 2** Relationship Between ZooKeeper and YARN + +#. When the system is started, ResourceManager attempts to write state information to ZooKeeper. ResourceManager that first writes state information to ZooKeeper is selected as the active ResourceManager, and others are standby ResourceManagers. The standby ResourceManagers periodically monitor active ResourceManager election information in ZooKeeper. +#. The active ResourceManager creates the **Statestore** directory in ZooKeeper to store application information. If the active ResourceManager is faulty, the standby ResourceManager obtains application information from the **Statestore** directory and restores the data. + +Relationship Between ZooKeeper and HBase +---------------------------------------- + +:ref:`Figure 3 ` shows the relationship between ZooKeeper and HBase. + +.. _mrs_08_00702__f2b849e3767fb4026b214c0d33566684d: + +.. figure:: /_static/images/en-us_image_0000001349390705.png + :alt: **Figure 3** Relationship between ZooKeeper and HBase + + **Figure 3** Relationship between ZooKeeper and HBase + +#. HRegionServer registers itself to ZooKeeper on Ephemeral node. ZooKeeper stores the HBase information, including the HBase metadata and HMaster addresses. +#. HMaster detects the health status of each HRegionServer using ZooKeeper, and monitors them. +#. HBase supports multiple HMaster nodes (like HDFS NameNodes). When the active HMatser is faulty, the standby HMaster obtains the state information about the entire cluster using ZooKeeper. That is, using ZooKeeper can avoid HBase SPOFs. + +Relationship Between ZooKeeper and Kafka +---------------------------------------- + +:ref:`Figure 4 ` shows the relationship between ZooKeeper and Kafka. + +.. _mrs_08_00702__fig17441129145312: + +.. figure:: /_static/images/en-us_image_0000001296590694.png + :alt: **Figure 4** Relationship between ZooKeeper and Kafka + + **Figure 4** Relationship between ZooKeeper and Kafka + +#. Broker uses ZooKeeper to register broker information and elect a partition leader. +#. The consumer uses ZooKeeper to register consumer information, including the partition list of consumer. In addition, ZooKeeper is used to discover the broker list, establish a socket connection with the partition leader, and obtain messages. diff --git a/umn/source/overview/components/zookeeper/zookeeper_basic_principle.rst b/umn/source/overview/components/zookeeper/zookeeper_basic_principle.rst new file mode 100644 index 0000000..63bcc2d --- /dev/null +++ b/umn/source/overview/components/zookeeper/zookeeper_basic_principle.rst @@ -0,0 +1,71 @@ +:original_name: mrs_08_00701.html + +.. _mrs_08_00701: + +ZooKeeper Basic Principle +========================= + +ZooKeeper Overview +------------------ + +ZooKeeper is a distributed, highly available coordination service. ZooKeeper provides two functions: + +- Prevents the system from single point of failures (SPOFs) and provides reliable services for applications. +- Provides distributed coordination services and manages configuration information. + +ZooKeeper Architecture +---------------------- + +Nodes in a ZooKeeper cluster have three roles: Leader, Follower, and Observer. :ref:`Figure 1 ` shows the ZooKeeper architecture. Generally, an odd number (2N+1) of ZooKeeper servers are configured. At least (N+1) vote majority is required to successfully perform write operation. + +.. _mrs_08_00701__f85514e7f75a34c4e9e1ad8c4b774f6d2: + +.. figure:: /_static/images/en-us_image_0000001349310065.png + :alt: **Figure 1** ZooKeeper architecture + + **Figure 1** ZooKeeper architecture + +:ref:`Table 1 ` describes the functions of each module shown in :ref:`Figure 1 `. + +.. _mrs_08_00701__t2851557cbd1e4fadbbd7f2e21eb7b070: + +.. table:: **Table 1** ZooKeeper modules + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Module | Description | + +===================================+===============================================================================================================================================================================================================================================================+ + | Leader | Only one node serves as the Leader in a ZooKeeper cluster. The Leader, elected by Followers using the ZooKeeper Atomic Broadcast (ZAB) protocol, receives and coordinates all write requests and synchronizes written information to Followers and Observers. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Follower | Follower has two functions: | + | | | + | | - Prevents SPOF. A new Leader is elected from Followers when the Leader is faulty. | + | | - Processes read requests and interacts with the Leader to process write requests. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Observer | The Observer does not take part in voting for election and write requests. It only processes read requests and forwards write requests to the Leader, increasing system processing efficiency. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Client | Reads and writes data from or to the ZooKeeper cluster. For example, HBase can serve as a ZooKeeper client and use the arbitration function of the ZooKeeper cluster to control the active/standby status of the HMaster. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +If security services are enabled in the cluster, authentication is required during the connection to ZooKeeper. The authentication modes are as follows: + +- keytab mode: Obtain a human-machine user from the administrator for login to the platform and authentication, and obtain the keytab file of the user. +- Ticket mode: Obtain a human-machine user from the administrator for subsequent secure login, enable the renewable and forwardable functions of the Kerberos service, set the ticket update interval, and restart Kerberos and related components. + + .. note:: + + - The default validity period of a user password is 90 days. Therefore, the validity period of the obtained keytab file is 90 days. To prolong the validity period of the keytab file, modify the user password policy and obtain the keytab file again. For details, see the *Administrator Guide*. + - The parameters for enabling the renewable and forwardable functions and setting the ticket update interval are on the **System** tab of the Kerberos service configuration page. The ticket update interval can be set to **kdc_renew_lifetime** or **kdc_max_renewable_life** based on the actual situation. + +ZooKeeper Principle +------------------- + +- **Write Request** + + #. After the Follower or Observer receives a write request, the Follower or Observer sends the request to the Leader. + #. The Leader coordinates Followers to determine whether to accept the write request by voting. + #. If more than half of voters return a write success message, the Leader submits the write request and returns a success message. Otherwise, a failure message is returned. + #. The Follower or Observer returns the processing results. + +- **Read Request** + + The client directly reads data from the Leader, Follower, or Observer. diff --git a/umn/source/overview/components/zookeeper/zookeeper_enhanced_open_source_features.rst b/umn/source/overview/components/zookeeper/zookeeper_enhanced_open_source_features.rst new file mode 100644 index 0000000..0de852b --- /dev/null +++ b/umn/source/overview/components/zookeeper/zookeeper_enhanced_open_source_features.rst @@ -0,0 +1,129 @@ +:original_name: mrs_08_00703.html + +.. _mrs_08_00703: + +ZooKeeper Enhanced Open Source Features +======================================= + +Enhanced Log +------------ + +In security mode, an ephemeral node is deleted as long as the session that created the node expires. Ephemeral node deletion is recorded in audit logs so that ephemeral node status can be obtained. + +Usernames must be added to audit logs for all operations performed on ZooKeeper clients. + +On the ZooKeeper client, create a znode, of which the Kerberos principal is **zkcli/hadoop.**\ **\ **@**\ **. + +For example, open the **/zookeeper_audit.log** file. The file content is as follows: + +.. code-block:: + + 2016-12-28 14:17:10,505 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test1?result=success + 2016-12-28 14:17:10,530 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test2?result=success + 2016-12-28 14:17:10,550 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test3?result=success + 2016-12-28 14:17:10,570 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test4?result=success + 2016-12-28 14:17:10,592 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test5?result=success + 2016-12-28 14:17:10,613 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test6?result=success + 2016-12-28 14:17:10,633 | INFO | CommitProcWorkThread-4 | session=0x12000007553b4903?user=10.177.223.78,zkcli/hadoop.hadoop.com@HADOOP.COM?ip=10.177.223.78?operation=create znode?target=ZooKeeperServer?znode=/test7?result=success + +The content shows that logs of the ZooKeeper client user **zkcli/hadoop.hadoop.com@HADOOP.COM** are added to the audit log. + +**User details in ZooKeeper** + +In ZooKeeper, different authentication schemes use different credentials as users. Based on the authentication provider requirement, any parameter can be considered as users. + +Example: + +- **SAMLAuthenticationProvider** uses the client principal as a user. +- **X509AuthenticationProvider** uses the user client certificate as a user. +- **IAuthenticationProvider** uses the client IP address as a user. +- A username can be obtained from the custom authentication provider by implementing the **org.apache.zookeeper.server.auth.ExtAuthenticationProvider.getUserName(String)** method. If the method is not implemented, getting the username from the authentication provider instance will be skipped. + +Enhanced Open Source Feature: ZooKeeper SSL Communication (Netty Connection) +---------------------------------------------------------------------------- + +The ZooKeeper design contains the Nio package and does not support SSL later than version 3.5. To solve this problem, Netty is added to ZooKeeper. Therefore, if you need to use SSL, enable Netty and set the following parameters on the server and client: + +The open source server supports only plain text passwords, which may cause security problems. Therefore, such text passwords are no longer used on the server. + +- Client + + #. Set **-Dzookeeper.client.secure** in the **zkCli.sh/zkEnv.sh** file to **true** to use secure communication on the client. Then, the client can connect to the secureClientPort on the server. + #. Set the following parameters in the **zkCli.sh/zkEnv.sh** file to configure the client environment: + + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=====================================+===================================================================================================+ + | -Dzookeeper.clientCnxnSocket | Used for Netty communication between clients. | + | | | + | | Default value: **org.apache.zookeeper.ClientCnxnSocketNetty** | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.keyStore.location | Indicates the path for storing the keystore file. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.keyStore.password | Encrypts a password. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.trustStore.location | Indicates the path for storing the truststore file. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.trustStore.password | Encrypts a password. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.config.crypt.class | Decrypts an encrypted password. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.password.encrypted | Default value: **false** | + | | | + | | If the keystore and truststore passwords are encrypted, set this parameter to **true**. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.enabled.protocols | Defines the SSL protocols to be enabled for the SSL context. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + | -Dzookeeper.ssl.exclude.cipher.ext | Defines the list of passwords separated by a comma which should be excluded from the SSL context. | + +-------------------------------------+---------------------------------------------------------------------------------------------------+ + + .. note:: + + The preceding parameters must be set in the **zkCli.sh/zk.Env.sh** file. + +- Server + + #. Set **secureClientPort** to **3381** in the **zoo.cfg** file. + #. Set **zookeeper.serverCnxnFactory** to **org.apache.zookeeper.server.NettyServerCnxnFactory** in the **zoo.cfg** file on the server. + #. Set the following parameters in the **zoo.cfg** file (in the **zookeeper/conf/zoo.cfg** path) to configure the server environment: + + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================+ + | ssl.keyStore.location | Path for storing the **keystore.jks** file | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.keyStore.password | Encrypts a password. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.trustStore.location | Indicates the path for storing the truststore file. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.trustStore.password | Encrypts a password. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | config.crypt.class | Decrypts an encrypted password. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.keyStore.password.encrypted | Default value: **false** | + | | | + | | If this parameter is set to **true**, the encrypted password can be used. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.trustStore.password.encrypted | Default value: **false** | + | | | + | | If this parameter is set to **true**, the encrypted password can be used. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.enabled.protocols | Defines the SSL protocols to be enabled for the SSL context. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + | ssl.exclude.cipher.ext | Defines the list of passwords separated by a comma which should be excluded from the SSL context. | + +-----------------------------------+---------------------------------------------------------------------------------------------------+ + + #. Start ZKserver and connect the security client to the security port. + +- Credential + + The credential used between client and server in ZooKeeper is **X509AuthenticationProvider**. This credential is initialized using the server certificates specified and trusted by the following parameters: + + - zookeeper.ssl.keyStore.location + - zookeeper.ssl.keyStore.password + - zookeeper.ssl.trustStore.location + - zookeeper.ssl.trustStore.password + + .. note:: + + If you do not want to use default mechanism of ZooKeeper, then it can be configured with different trust mechanisms as needed. diff --git a/umn/source/overview/constraints.rst b/umn/source/overview/constraints.rst new file mode 100644 index 0000000..41d1737 --- /dev/null +++ b/umn/source/overview/constraints.rst @@ -0,0 +1,33 @@ +:original_name: mrs_08_0027.html + +.. _mrs_08_0027: + +Constraints +=========== + +Before using MRS, ensure that you have read and understand the following restrictions. + +- MRS clusters must be created in VPC subnets. +- You are advised to use any of the following browsers to access MRS: + + - Google Chrome: 36.0 or later + + - Internet Explorer: 9.0 or later + + If you use Internet Explorer 9.0, you may fail to log in to the MRS management console because user **Administrator** is disabled by default in some Windows systems, such as Windows 7 Ultimate. Internet Explorer automatically selects a system user for installation. As a result, Internet Explorer cannot access the management console. Reinstall Internet Explorer 9.0 or later (recommended) or run Internet Explorer 9.0 as user **Administrator**. + +- When you create an MRS cluster, you can select **Auto create** from the drop-down list of **Security Group** to create a security group or select an existing security group. After the MRS cluster is created, do not delete or modify the used security group. Otherwise, a cluster exception may occur. +- To prevent illegal access, only assign access permission for security groups used by MRS where necessary. +- Do not perform the following operations because they will cause cluster exceptions: + + - Shutting down, restarting, or deleting MRS cluster nodes displayed in ECS, changing or reinstalling their OS, or modifying their specifications. + - Deleting the existing processes, applications or files on cluster nodes. + +- If a cluster exception occurs when no incorrect operations have been performed, contact technical support engineers. They will ask you for your key and then perform troubleshooting. +- Plan disks of cluster nodes based on service requirements. If you want to store a large volume of service data, add EVS disks or storage space to prevent insufficient storage space from affecting node running. +- The cluster nodes store only users' service data. Non-service data can be stored in the OBS or other ECS nodes. +- The cluster nodes only run MRS cluster programs. Other client applications or user service programs are deployed on separate ECS nodes. +- When you expand the storage capacity of nodes (including master, core, and task) in an MRS cluster, you are advised to create new disks and then attach them to the nodes. +- The capacity (including storage and computing capabilities) of an MRS cluster can be expanded by adding core or task nodes. +- If the cluster is still used to execute tasks or modify configurations after a master node in the cluster is stopped, and other master nodes in the cluster are stopped before the stopped master node is started after task execution or configuration modification, data may be lost due to an active/standby switchover. In this scenario, after the task is executed or the configuration is modified, start the master node that has been stopped and then stop all nodes. If all nodes in the cluster have been stopped, start them in the reverse order of node shutdown. +- The Capacity and Superior scheduler switchover is complete when the MRS cluster is used, while configuration synchronization is not complete. Configure synchronization again based on the new scheduler if necessary. diff --git a/umn/source/overview/functions/bootstrap_actions.rst b/umn/source/overview/functions/bootstrap_actions.rst new file mode 100644 index 0000000..5f1f9f1 --- /dev/null +++ b/umn/source/overview/functions/bootstrap_actions.rst @@ -0,0 +1,25 @@ +:original_name: mrs_08_0025.html + +.. _mrs_08_0025: + +Bootstrap Actions +================= + +Feature Introduction +-------------------- + +MRS provides standard elastic big data clusters on the cloud. Nine big data components, such as Hadoop and Spark, can be installed and deployed. Currently, standard cloud big data clusters cannot meet all user requirements, for example, in the following scenarios: + +- Common operating system configurations cannot meet data processing requirements, for example, increasing the maximum number of system connections. +- Software tools or running environments need to be installed, for example, Gradle and dependency R language package. +- Big data component packages need to be modified based on service requirements, for example, modifying the Hadoop or Spark installation package. +- Other big data components that are not supported by MRS need to be installed. + +To meet the preceding customization requirements, you can manually perform operations on the existing and newly added nodes. The overall process is complex and error-prone. In addition, manual operations cannot be traced, and data cannot be processed immediately after creating a cluster based on your demand. + +Therefore, MRS supports custom bootstrap actions that enable you to run scripts on a specified node before or after a cluster component is started. You can run bootstrap actions to install third-party software that is not supported by MRS, modify the cluster running environment, and perform other customizations. If you choose to run bootstrap actions when expanding a cluster, the bootstrap actions will be run on the newly added nodes in the same way. MRS runs the script you specify as user **root**. You can run the **su - xxx** command in the script to switch the user. + +Customer Benefits +----------------- + +You can use the custom bootstrap actions to flexibly and easily configure your dedicated clusters and customize software installation. diff --git a/umn/source/overview/functions/cluster_management/auto_scaling.rst b/umn/source/overview/functions/cluster_management/auto_scaling.rst new file mode 100644 index 0000000..c6be265 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/auto_scaling.rst @@ -0,0 +1,41 @@ +:original_name: mrs_08_0022.html + +.. _mrs_08_0022: + +Auto Scaling +============ + +Feature Introduction +-------------------- + +More and more enterprises use technologies such as Spark and Hive to analyze data. Processing a large amount of data consumes huge resources and costs much. Typically, enterprises regularly analyze data in a fixed period of time every day rather than all day long. To meet enterprises' requirements, MRS provides the auto scaling function to apply for extra resources during peak hours and release resources during off-peak hours. This enables users to use resources on demand and focus on core business at lower costs. + +In big data applications, especially in periodic data analysis and processing scenarios, cluster computing resources need to be dynamically adjusted based on service data changes to meet service requirements. The auto scaling function of MRS enables clusters to be elastically scaled out or in based on cluster loads. In addition, if the data volume changes regularly and you want to scale out or in a cluster before the data volume changes, you can use the MRS resource plan feature. + +MRS supports two types of auto scaling policies: auto scaling rules and resource plans + +- Auto scaling rules: You can increase or decrease Task nodes based on real-time cluster loads. Auto scaling will be triggered when the data volume changes but there may be some delay. +- Resource plans: If the data volume changes periodically, you can create resource plans to resize the cluster before the data volume changes, thereby avoiding a delay in increasing or decreasing resources. + +Both auto scaling rules and resource plans can trigger auto scaling. You can configure both of them or configure one of them. Configuring both resource plans and auto scaling rules improves the cluster node scalability to cope with occasionally unexpected data volume peaks. + +In some service scenarios, resources need to be reallocated or service logic needs to be modified after cluster scale-out or scale-in. If you manually scale out or scale in a cluster, you can log in to cluster nodes to reallocate resources or modify service logic. If you use auto scaling, MRS enables you to customize automation scripts for resource reallocation and service logic modification. Automation scripts can be executed before and after auto scaling and automatically adapt to service load changes, all of which eliminates manual operations. In addition, automation scripts can be fully customized and executed at various moments, which can meet your personalized requirements and improve auto scaling flexibility. + +Customer Benefits +----------------- + +MRS auto scaling provides the following benefits: + +- Reducing costs + + Enterprises do not analyze data all the time but perform a batch data analysis in a specified period of time, for example, 03:00 a.m. The batch analysis may take only two hours. + + The auto scaling function enables enterprises to add nodes for batch analysis and automatically releases the nodes after completion of the analysis, minimizing costs. + +- Meeting instant query requirements + + Enterprises usually encounter instant analysis tasks, for example, data reports for supporting enterprise decision-making. As a result, resource consumption increases sharply in a short period of time. With the auto scaling function, compute nodes can be added for emergent big data analysis, avoiding a service breakdown due to insufficient compute resources. In this way, you do not need to create extra resources. After the emergency ends, MRS automatically releases the nodes. + +- Focusing on core business + + It is difficult for developers to determine resource consumption on the big data secondary development platform because of complex query analysis conditions (such as global sorting, filtering, and merging) and data complexity, for example, uncertainty of incremental data. As a result, estimating the computing volume is difficult. MRS's auto scaling function enable developers to focus on service development without the need for resource estimation. diff --git a/umn/source/overview/functions/cluster_management/cluster_lifecycle_management.rst b/umn/source/overview/functions/cluster_management/cluster_lifecycle_management.rst new file mode 100644 index 0000000..60e46e6 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/cluster_lifecycle_management.rst @@ -0,0 +1,45 @@ +:original_name: mrs_08_0053.html + +.. _mrs_08_0053: + +Cluster Lifecycle Management +============================ + +MRS supports cluster lifecycle management, including creating and terminating clusters. + +- Creating a cluster: After you specify a cluster type, components, number of nodes of each type, VM specifications, AZ, VPC, and authentication information, MRS automatically creates a cluster that meets the configuration requirements. You can run customized scripts in the cluster. In addition, you can create clusters of different types for multiple application scenarios, such as Hadoop analysis clusters, HBase clusters, and Kafka clusters. The big data platform supports heterogeneous cluster deployment. That is, VMs of different specifications can be combined in a cluster based on CPU types, disk capacities, disk types, and memory sizes. Various VM specifications can be mixed in a cluster. +- Terminating a cluster: You can terminate a cluster that is no longer needed (including data and configurations in the cluster). MRS will delete all resources related to the cluster. + +Creating a Cluster +------------------ + +On the MRS management console, you can create an MRS cluster. You can select a region and cloud resource specifications to create an MRS cluster that is suitable for enterprise services in one click. MRS automatically installs and deploys the enterprise-level big data platform and optimizes parameters based on the selected cluster type, version, and node specifications. + +MRS provides you with fully managed big data clusters. When creating a cluster, you can set a VM login mode (password or key pair). You can use all resources of the created MRS cluster. In addition, MRS allows you to deploy a big data cluster on only two ECSs with 4 vCPUs and 8 GB memory, providing more flexible choices for testing and development. + +MRS clusters are classified into analysis, streaming, and hybrid clusters. + +- Analysis cluster: is used for offline data analysis and provides Hadoop components. +- Streaming cluster: is used for streaming tasks and provides stream processing components. +- Hybrid cluster: is used for not only offline data analysis but also streaming processing, and provides Hadoop components and stream processing components. +- Custom: You can flexibly combine required components (MRS 3.x and later versions) based on service requirements. + +MRS cluster nodes are classified into Master, Core, and Task nodes. + +- Master node: management node in a cluster. Master processes of a distributed system, Manager, and databases are deployed on Master nodes. Master nodes cannot be scaled out. The processing capability of Master nodes determines the upper limit of the management capability of the entire cluster. MRS supports scale-up of Master node specifications to provide support for management of a larger cluster. +- Core node: used for both storage and computing and can be scaled in or out. Since Core nodes bear data storage, there are many restrictions on scale-in to prevent data loss and auto scaling cannot be performed. +- Task node: used only for computing only and can be scaled in or out. Task nodes bear only computing tasks. Therefore, auto scaling can be performed. + +You can create a cluster in two modes: custom create a cluster and quick create a cluster. + +- **Custom config**: On the **Custom Config** page, you can flexibly configure cluster parameters based on application scenarios, such as ECS specifications to better suit your service requirements. +- **Quick config**: On the **Quick Config** page, you can quickly create a cluster based on application scenarios, improving cluster configuration efficiency. Currently, Hadoop analysis clusters, HBase clusters, and Kafka clusters are available for your quick creation. + + - Hadoop analysis cluster: uses components in the open-source Hadoop ecosystem to analyze and query vast amounts of data. For example, use Yarn to manage cluster resources, Hive and Spark to provide offline storage and computing of large-scale distributed data, Spark Streaming and Flink to offer streaming data computing, and Presto to enable interactive queries, and Tez to provide a distributed computing framework of directed acyclic graphs (DAGs). + - HBase cluster: uses Hadoop and HBase components to provide a column-oriented distributed cloud storage system featuring enhanced reliability, excellent performance, and elastic scalability. It applies to the storage and distributed computing of massive amounts of data. You can use HBase to build a storage system capable of storing TB- or even PB-level data. With HBase, you can filter and analyze data with ease and get responses in milliseconds, rapidly mining data value. + - Kafka cluster: uses Kafka and Storm to provide an open source message system with high throughput and scalability. It is widely used in scenarios such as log collection and monitoring data aggregation to implement efficient streaming data collection and real-time data processing and storage. + +Terminating a Cluster +--------------------- + +MRS allows you to terminate a cluster when it is no longer needed. After the cluster is terminated, all cloud resources used by the cluster will be released. Before terminating a cluster, you are advised to migrate or back up data. Terminate the cluster only when no service is running in the cluster or the cluster is abnormal and cannot provide services based on O&M analysis. If data is stored on EVS disks or pass-through disks in a big data cluster, the data will be deleted after the cluster is terminated. Therefore, exercise caution when terminating a cluster. diff --git a/umn/source/overview/functions/cluster_management/index.rst b/umn/source/overview/functions/cluster_management/index.rst new file mode 100644 index 0000000..f4adcf9 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_08_0048.html + +.. _mrs_08_0048: + +Cluster Management +================== + +- :ref:`Cluster Lifecycle Management ` +- :ref:`Manually Scale Out/In a Cluster ` +- :ref:`Auto Scaling ` +- :ref:`Task Node Creation ` +- :ref:`Isolating a Host ` +- :ref:`Managing Tags ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + cluster_lifecycle_management + manually_scale_out_in_a_cluster + auto_scaling + task_node_creation + isolating_a_host + managing_tags diff --git a/umn/source/overview/functions/cluster_management/isolating_a_host.rst b/umn/source/overview/functions/cluster_management/isolating_a_host.rst new file mode 100644 index 0000000..6a53c69 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/isolating_a_host.rst @@ -0,0 +1,10 @@ +:original_name: mrs_08_0056.html + +.. _mrs_08_0056: + +Isolating a Host +================ + +When detecting that a host is abnormal or faulty and cannot provide services or affects cluster performance, you can exclude the host from the available nodes in the cluster temporarily so that the client can access other available nodes. In scenarios where patches are to be installed in a cluster, you can also exclude a specified node from patch installation. Only non-management nodes can be isolated. + +After a host is isolated, all role instances on the host will be stopped, and you cannot start, stop, or configure the host and all instances on the host. In addition, after a host is isolated, statistics about the monitoring status and metric data of hardware and instances on the host cannot be collected or displayed. diff --git a/umn/source/overview/functions/cluster_management/managing_tags.rst b/umn/source/overview/functions/cluster_management/managing_tags.rst new file mode 100644 index 0000000..0a28ef8 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/managing_tags.rst @@ -0,0 +1,10 @@ +:original_name: mrs_08_0057.html + +.. _mrs_08_0057: + +Managing Tags +============= + +Tags are cluster identifiers. Adding tags to clusters can help you identify and manage your cluster resources. By associating with Tag Management Service (TMS), MRS allows users with a large number of cloud resources to tag cloud resources, quickly search for cloud resources with the same tag attribute, and perform unified management operations such as review, modification, and deletion, facilitating unified management of big data clusters and other cloud resources. + +You can add a maximum of 10 tags to a cluster when creating the cluster or add them on the details page of the created cluster. diff --git a/umn/source/overview/functions/cluster_management/manually_scale_out_in_a_cluster.rst b/umn/source/overview/functions/cluster_management/manually_scale_out_in_a_cluster.rst new file mode 100644 index 0000000..88e8615 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/manually_scale_out_in_a_cluster.rst @@ -0,0 +1,22 @@ +:original_name: mrs_08_0054.html + +.. _mrs_08_0054: + +Manually Scale Out/In a Cluster +=============================== + +The processing capability of a big data cluster can be horizontally expanded by adding nodes. If the cluster scale does not meet service requirements, you can manually scale out or scale in the cluster. MRS intelligently selects the node with the least load or the minimum amount of data to be migrated for scale-in. The node to be scaled in will not receive new tasks, and continues to execute the existing tasks. At the same time, MRS copies its data to other nodes and the node is decommissioned. If the tasks on the node cannot be completed after a long time, MRS migrates the tasks to other nodes, minimizing the impact on cluster services. + +Scaling Out a Cluster +--------------------- + +Currently, you can add Core or Task nodes to scale out a cluster to handle peak service loads. The capacity expansion of an MRS cluster node does not affect the services of the existing cluster. For details about data skew caused by capacity expansion, see :ref:`How Do I Balance HDFS Data? ` to rectify the fault. + +Scaling In a Cluster +-------------------- + +You can reduce the number of Core or Task nodes to scale in a cluster so that MRS delivers better storage and computing capabilities at lower O&M costs based on service requirements. After you scale in an MRS cluster, MRS automatically selects nodes that can be scaled in based on the type of services installed on the nodes. + +During the scale-in of Core nodes, data on the original nodes is migrated. If the data location is cached, the client automatically updates the location information, which may affect the latency. Node scale-in may affect the response duration of the first access to some HBase on HDFS data. You can restart HBase or disable or enable related tables to avoid this problem. + +Task nodes do not store cluster data. They are compute nodes and do not involve migration of data on the nodes. diff --git a/umn/source/overview/functions/cluster_management/task_node_creation.rst b/umn/source/overview/functions/cluster_management/task_node_creation.rst new file mode 100644 index 0000000..af050c9 --- /dev/null +++ b/umn/source/overview/functions/cluster_management/task_node_creation.rst @@ -0,0 +1,24 @@ +:original_name: mrs_08_0023.html + +.. _mrs_08_0023: + +Task Node Creation +================== + +Feature Introduction +-------------------- + +Task nodes can be created and used for computing only. They do not store persistent data and are the basis for implementing auto scaling. + +Customer Benefits +----------------- + +When MRS is used only as a computing resource, Task nodes can be used to reduce costs and facilitate cluster node scaling, flexibly meeting users' requirements for increasing or decreasing cluster computing capabilities. + +Application Scenarios +--------------------- + +When the data volume change is small in a cluster but the cluster's service processing capabilities need to be remarkably and temporarily improved, add Task nodes to address the following situations: + +- The number of temporary services is increased, for example, report processing at the end of the year. +- Long-term tasks need to be completed in a short time, for example, some urgent analysis tasks. diff --git a/umn/source/overview/functions/cluster_o&m.rst b/umn/source/overview/functions/cluster_o&m.rst new file mode 100644 index 0000000..040bdc0 --- /dev/null +++ b/umn/source/overview/functions/cluster_o&m.rst @@ -0,0 +1,37 @@ +:original_name: mrs_08_0049.html + +.. _mrs_08_0049: + +Cluster O&M +=========== + +Alarm Management +---------------- + +MRS can monitor big data clusters in real time and identify system health status based on alarms and events. In addition, MRS allows you to customize monitoring and alarm thresholds to focus on the health status of each metric. When monitoring data reaches the alarm threshold, the system triggers an alarm. + +MRS can also interconnect with the message service system of the Simple Message Notification (SMN) service to push alarm information to users by SMS message or email. For details, see :ref:`Message Notification `. + +Patch Management +---------------- + +MRS supports cluster patching operations and will release patches for open source big data components in a timely manner. On the MRS cluster management page, you can view patch release information related to running clusters, including the detailed description of the resolved issues and impacts. You can determine whether to install a patch based on the service running status. One-click patch installation involves no manual intervention, and will not cause service interruption through rolling installation, ensuring long-term availability of the clusters. + +MRS can display the detailed patch installation process. Patch management also supports patch uninstallation and rollback. + +.. note:: + + MRS 3.x or later does not support patch management on the management console. + +O&M Support +----------- + +Cluster resources provided by MRS belong to users. Generally, when O&M personnel's support is required for troubleshooting of a cluster, O&M personnel cannot directly access the cluster. To better serve customers, MRS provides the following two methods to improve communication efficiency during fault locating: + +- Log sharing: You can initiate log sharing on the MRS management console to share a specified log scope with O&M personnel, so that O&M personnel can locate faults without accessing the cluster. +- O&M authorization: If a problem occurs when you use an MRS cluster, you can initiate O&M authorization on the MRS management console. O&M personnel can help you quickly locate the problem, and you can revoke the authorization at any time. + +Health Check +------------ + +MRS provides automatic inspection on system running environments for you to check and audit system running health status in one click, ensuring proper system running and lowering system operation and maintenance costs. After viewing inspection results, you can export reports for archiving and fault analysis. diff --git a/umn/source/overview/functions/easy_access_to_web_uis_of_components.rst b/umn/source/overview/functions/easy_access_to_web_uis_of_components.rst new file mode 100644 index 0000000..6e9f054 --- /dev/null +++ b/umn/source/overview/functions/easy_access_to_web_uis_of_components.rst @@ -0,0 +1,10 @@ +:original_name: mrs_08_0044.html + +.. _mrs_08_0044: + +Easy Access to Web UIs of Components +==================================== + +Big data components have their own web UIs to manage their own systems. However, you cannot easily access the web UIs due to network isolation. For example, to access the HDFS web UI, you need to create an ECS to remotely log in to the web UI. This makes the UI access complex and unfriendly. + +MRS provides an EIP-based secure channel for you to easily access the web UIs of components. This is more convenient than binding an EIP by yourself, and you can access the web UIs with a few clicks, avoiding the steps for logging in to a VPC, adding security group rules, and obtaining a public IP address. For the Hadoop, Spark, HBase, and Hue components in analysis clusters and the Storm component in streaming clusters, you can quickly access their web UIs from the entries on Manager. diff --git a/umn/source/overview/functions/index.rst b/umn/source/overview/functions/index.rst new file mode 100644 index 0000000..858c08b --- /dev/null +++ b/umn/source/overview/functions/index.rst @@ -0,0 +1,32 @@ +:original_name: mrs_08_0006.html + +.. _mrs_08_0006: + +Functions +========= + +- :ref:`Multi-tenant ` +- :ref:`Security Hardening ` +- :ref:`Easy Access to Web UIs of Components ` +- :ref:`Reliability Enhancement ` +- :ref:`Job Management ` +- :ref:`Bootstrap Actions ` +- :ref:`Metadata ` +- :ref:`Cluster Management ` +- :ref:`Cluster O&M ` +- :ref:`Message Notification ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + multi-tenant + security_hardening + easy_access_to_web_uis_of_components + reliability_enhancement + job_management + bootstrap_actions + metadata + cluster_management/index + cluster_o&m + message_notification diff --git a/umn/source/overview/functions/job_management.rst b/umn/source/overview/functions/job_management.rst new file mode 100644 index 0000000..6aa36a6 --- /dev/null +++ b/umn/source/overview/functions/job_management.rst @@ -0,0 +1,10 @@ +:original_name: mrs_08_0046.html + +.. _mrs_08_0046: + +Job Management +============== + +The job management function provides an entry for you to submit jobs in a cluster, including MapReduce, Spark, HiveQL, and SparkSQL jobs. MRS works with Data Lake Governance Center (DGC) to provide a one-stop big data collaboration development environment and fully-managed big data scheduling capabilities, helping you effortlessly build big data processing centers. + +DGC allows you to develop and debug MRS HiveQL/SparkSQL scripts online and develop MRS jobs by performing drag-and-drop operations to migrate and integrate data between MRS and over 20 heterogeneous data sources. Powerful job scheduling and flexible monitoring and alarming help you easily manage data and job O&M. diff --git a/umn/source/overview/functions/message_notification.rst b/umn/source/overview/functions/message_notification.rst new file mode 100644 index 0000000..3192d86 --- /dev/null +++ b/umn/source/overview/functions/message_notification.rst @@ -0,0 +1,35 @@ +:original_name: mrs_08_0024.html + +.. _mrs_08_0024: + +Message Notification +==================== + +Feature Introduction +-------------------- + +The following operations are often performed during the running of a big data cluster: + +- Big data clusters often change, for example, cluster scale-out and scale-in. +- When a service data volume changes abruptly, auto scaling will be triggered. +- After related services are stopped, a big data cluster needs to be stopped. + +To immediately notify you of successful operations, cluster unavailability, and node faults, MRS uses Simple Message Notification (SMN) to send notifications to you through SMS and emails, facilitating maintenance. + +Customer Benefits +----------------- + +After configuring SMN, you can receive MRS cluster health status, updates, and component alarms through SMS or emails in real time. MRS sends real-time monitoring and alarm notification to help you easily perform O&M and efficiently deploy big data services. + +Feature Description +------------------- + +MRS uses SMN to provide one-to-multiple message subscription and notification over a variety of protocols. + +You can create a topic and configure topic policies to control publisher and subscriber permissions on the topic. MRS sends cluster messages to the topic to which you have permission to publish messages. Then, all subscribers who subscribe to the topic can receive cluster updates and component alarms through SMS and emails. + + +.. figure:: /_static/images/en-us_image_0000001296750222.png + :alt: **Figure 1** Implementation process + + **Figure 1** Implementation process diff --git a/umn/source/overview/functions/metadata.rst b/umn/source/overview/functions/metadata.rst new file mode 100644 index 0000000..dcd5709 --- /dev/null +++ b/umn/source/overview/functions/metadata.rst @@ -0,0 +1,17 @@ +:original_name: mrs_08_0075.html + +.. _mrs_08_0075: + +Metadata +======== + +MRS provides multiple metadata storage methods. When deploying Hive and Ranger during MRS cluster creation, select one of the following storage modes as required: + +- **Local**: Metadata is stored in the local GaussDB of a cluster. When the cluster is deleted, the metadata is also deleted. To retain the metadata, manually back up the metadata in the database in advance. +- **Data Connection**: Metadata is stored in the associated PostgreSQL or MySQL database of the RDS service in the same VPC and subnet as the current cluster. When the cluster is terminated, the metadata is not deleted. Multiple MRS clusters can share the metadata. + +.. note:: + + Hive in MRS 1.9.\ *x* or later allows you to specify a metadata storage method. + + Ranger in MRS 1.9.\ *x* allows metadata to be stored only in the associated MySQL database of the RDS service. diff --git a/umn/source/overview/functions/multi-tenant.rst b/umn/source/overview/functions/multi-tenant.rst new file mode 100644 index 0000000..783e070 --- /dev/null +++ b/umn/source/overview/functions/multi-tenant.rst @@ -0,0 +1,65 @@ +:original_name: mrs_08_0042.html + +.. _mrs_08_0042: + +Multi-tenant +============ + +Feature Introduction +-------------------- + +Modern enterprises' data clusters are developing towards centralization and cloudification. Enterprise-class big data clusters must meet the following requirements: + +- Carry data of different types and formats and run jobs and applications of different types (analysis, query, and stream processing). +- Isolate data of a user from that of another user who has demanding requirements on data security, such as a bank or government institute. + +The preceding requirements bring the following challenges to the big data cluster: + +- Proper allocation and scheduling of resources to ensure stable operating of applications and jobs +- Strict access control to ensure data and service security + +Multi-tenant isolates the resources of a big data cluster into resource sets. Users can lease desired resource sets to run applications and jobs and store data. In a big data cluster, multiple resource sets can be deployed to meet diverse requirements of multiple users. + +The MRS big data cluster provides a complete enterprise-class big data multi-tenant solution. Multi-tenant is a collection of multiple resources (each resource set is a tenant) in an MRS big data cluster. It can allocate and schedule resources, including computing and storage resources. + +Advantages +---------- + +- Proper resource configuration and isolation + + The resources of a tenant are isolated from those of another tenant. The resource use of a tenant does not affect other tenants. This mechanism ensures that each tenant can configure resources based on service requirements, improving resource utilization. + +- Resource consumption measurement and statistics + + Tenants are system resource applicants and consumers. System resources are planned and allocated based on tenants. Resource consumption by tenants can be measured and recorded. + +- Ensured data security and access security + + In multi-tenant scenarios, the data of each tenant is stored separately to ensure data security. The access to tenants' resources is controlled to ensure access security. + +Enhanced Schedulers +------------------- + +Schedulers are divided into the open source Capacity scheduler and proprietary Superior scheduler. + +To meet enterprise requirements and tackle challenges facing the Yarn community in scheduling, develops the Superior scheduler. In addition to inheriting the advantages of the Capacity scheduler and Fair scheduler, this scheduler is enhanced in the following aspects: + +- Enhanced resource sharing policy + + The Superior scheduler supports queue hierarchy. It integrates the functions of open source schedulers and shares resources based on configurable policies. In terms of instances, administrators can use the Superior scheduler to configure an absolute value or percentage policy for queue resources. The resource sharing policy of the Superior scheduler enhances the label scheduling policy of Yarn as a resource pool feature. The nodes in the Yarn cluster can be grouped based on the capacity or service type to ensure that queues can more efficiently utilize resources. + +- Tenant-based resource reservation policy + + Resources required by tenants must be ensured for running critical tasks. The Superior scheduler builds a mechanism to support the resource reservation policy. By doing so, reserved resources can be allocated to the tasks run by the tenant queues in a timely manner to ensure proper task execution. + +- Fair sharing among tenants and resource pool users + + The Superior scheduler allows shared resources to be configured for users in a queue. Each tenant may have users with different weights. Heavily weighted users may require more shared resources. + +- Ensured scheduling performance in a big cluster + + The Superior scheduler receives heartbeats from each NodeManager and saves resource information in memory, which enables the scheduler to control cluster resource usage globally. The Superior scheduler uses the push scheduling model, which makes the scheduling more precise and efficient and remarkably improves cluster resource utilization. Additionally, the Superior scheduler delivers excellent performance when the interval between NodeManager heartbeats is long and prevents heartbeat storms in big clusters. + +- Priority policy + + If the minimum resource requirement of a service cannot be met after the service obtains all available resources, a preemption occurs. The preemption function is disabled by default. diff --git a/umn/source/overview/functions/reliability_enhancement.rst b/umn/source/overview/functions/reliability_enhancement.rst new file mode 100644 index 0000000..6844830 --- /dev/null +++ b/umn/source/overview/functions/reliability_enhancement.rst @@ -0,0 +1,65 @@ +:original_name: mrs_08_0045.html + +.. _mrs_08_0045: + +Reliability Enhancement +======================= + +Based on Apache Hadoop open source software, MRS optimizes and improves the reliability and performance of main service components. + +System Reliability +------------------ + +- HA for all management nodes + + In the Hadoop open source version, data and compute nodes are managed in a distributed system, in which a single point of failure (SPOF) does not affect the operation of the entire system. However, a SPOF may occur on management nodes running in centralized mode, which becomes the weakness of the overall system reliability. + + MRS provides similar double-node mechanisms for all management nodes of the service components, such as Manager, HDFS NameNodes, HiveServers, HBase HMasters, Yarn ResourceManagers, KerberosServers, and LdapServers. All of them are deployed in active/standby mode or configured with load sharing, effectively preventing SPOFs from affecting system reliability. + +- Reliability guarantee in case of exceptions + + By reliability analysis, the following measures to handle software and hardware exceptions are provided to improve the system reliability: + + - After power supply is restored, services are running properly regardless of a power failure of a single node or the whole cluster, ensuring data reliability in case of unexpected power failures. Key data will not be lost unless the hard disk is damaged. + - Health status checks and fault handling of the hard disk do not affect services. + - The file system faults can be automatically handled, and affected services can be automatically restored. + - The process and node faults can be automatically handled, and affected services can be automatically restored. + - The network faults can be automatically handled, and affected services can be automatically restored. + +- Data backup and restoration + + MRS provides full backup, incremental backup, and restoration functions based on service requirements, preventing the impact of data loss and damages on services and ensuring fast system restoration in case of exceptions. + + - Automatic backup + + MRS provides automatic backup for data on Manager. Based on the customized backup policy, data on clusters, including LdapServer and DBService data, can be automatically backed up. + + - Manual backup + + You can also manually back up data of the cluster management system before the capacity expansion and patch installation to recover the cluster management system functions upon faults. + + To improve the system reliability, data on Manager and HBase is backed up to a third-party server manually. + +Node Reliability +---------------- + +- OS health status monitoring + + MRS periodically collects OS hardware resource usage data, including usage of CPUs, memory, hard disks, and network resources. + +- Process health status monitoring + + MRS checks the status of service instances and health indicators of service instance processes, enabling you to know the health status of processes in a timely manner. + +- Automatic disk troubleshooting + + MRS is enhanced based on the open source version. It can monitor the status of hardware and file systems on all nodes. If an exception occurs, the corresponding partitions will be removed from the storage pool. If a disk is faulty and replaced, a new hard disk will be added for running services. In this case, maintenance operations are simplified. Replacement of faulty disks can be completed online. In addition, users can set hot backup disks to reduce the faulty disk restoration time and improve the system reliability. + +- LVM configuration for node disks + + MRS allows you to configure Logic Volume Management (LVM) to plan multiple disks as a logical volume group. Configuring LVM can avoid uneven usage of disks. It is especially important to ensure even usage of disks on components that can use multiple disk capabilities, such as HDFS and Kafka. In addition, LVM supports disk capacity expansion without re-attaching, preventing service interruption. + +Data Reliability +---------------- + +MRS can use the anti-affinity node groups and placement group capabilities provided by ECS and the rack awareness capability of Hadoop to redundantly distribute data to multiple physical host machines, preventing data loss caused by physical hardware failures. diff --git a/umn/source/overview/functions/security_hardening.rst b/umn/source/overview/functions/security_hardening.rst new file mode 100644 index 0000000..ec8f9df --- /dev/null +++ b/umn/source/overview/functions/security_hardening.rst @@ -0,0 +1,68 @@ +:original_name: mrs_08_0043.html + +.. _mrs_08_0043: + +Security Hardening +================== + +MRS is a platform for massive data management and analysis and has high security. MRS protects user data and service running from the following aspects: + +- Network isolation + + The entire system is deployed in a VPC on the public cloud to provide an isolated network environment and ensure service and management security of the cluster. By combining the subnet division, route control, and security group functions of VPC, MRS provides a secure and reliable isolated network environment. + +- Resource isolation + + MRS supports resource deployment and isolation of physical resources in dedicated zones. You can flexibly combine computing and storage resources, such as dedicated computing resources + shared storage resources, shared computing resources + dedicated storage resources, and dedicated computing resources + dedicated storage resources. + +- Host security + + MRS can be integrated with public cloud security services, including Vulnerability Scan Service (VSS), Host Security Service (HSS), Web Application Firewall (WAF), Cloud Bastion Host (CBH), and Web Tamper Protection (WTP). The following measures are provided to improve security of the OS and ports: + + - Security hardening of OS kernels + - OS patch update + - OS permission control + - OS port management + - OS protocol and port attack defense + +- Application security + + The following measures are used to ensure normal running of big data services: + + - Identification and authentication + - Web application security + - Access control + - Audit security + - Password security + +- Data security + + The following measures are provided to ensure the confidentiality, integrity, and availability of massive amounts of user data: + + - Disaster recovery: MRS supports data backup to OBS and cross-region high reliability. + - Backup: MRS supports backup of DBService, NameNode, and LDAP metadata and backup of HDFS and HBase service data. + +- Data integrity + + Data is verified to ensure its integrity during storage and transmission. + + - CRC32C is used by default to verify the correctness of user data stored in HDFS. + - DataNodes of HDFS store the verified data. If the data transmitted from a client is abnormal (incomplete), DataNodes report the abnormality to the client, and the client rewrites the data. + - The client checks data integrity when reading data from a DataNode. If the data is incomplete, the client will read data from another DataNode. + +- Data confidentiality + + Based on Apache Hadoop, the distributed file system of MRS supports encrypted storage of files to prevent sensitive data from being stored in plaintext, improving data security. Applications need only to encrypt specified sensitive data. Services are not affected during the encryption process. Based on file system data encryption, Hive provides table-level encryption and HBase provides column family-level encryption. Sensitive data can be encrypted and stored after you specify an encryption algorithm during table creation. + + Encrypted storage and access control of data are used to ensure user data security. + + - HBase stores service data to the HDFS after compression. Users can configure the AES and SMS4 encryption algorithm to encrypt data. + - All the components allow access permissions to be set for local data directories. Unauthorized users are not allowed to access data. + - All cluster user information is stored in ciphertext. + +- Security authentication + + - Uses a unified user- and role-based authentication system as well as an account- and role-based access control (RBAC) model to centrally control user permissions and batch manage user authorization. + - Employs Lightweight Directory Access Protocol (LDAP) as an account management system and performs the Kerberos authentication on accounts. + - Provides the single sign-on (SSO) function that centrally manages and authenticates MRS system and component users. + - Audits users who have logged in to Manager. diff --git a/umn/source/overview/index.rst b/umn/source/overview/index.rst new file mode 100644 index 0000000..1160965 --- /dev/null +++ b/umn/source/overview/index.rst @@ -0,0 +1,24 @@ +:original_name: en-us_topic_0000001295768772.html + +.. _en-us_topic_0000001295768772: + +Overview +======== + +- :ref:`What Is MRS? ` +- :ref:`Application Scenarios ` +- :ref:`Components ` +- :ref:`Functions ` +- :ref:`Constraints ` +- :ref:`Related Services ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + what_is_mrs + application_scenarios + components/index + functions/index + constraints + related_services diff --git a/umn/source/overview/related_services.rst b/umn/source/overview/related_services.rst new file mode 100644 index 0000000..7041b48 --- /dev/null +++ b/umn/source/overview/related_services.rst @@ -0,0 +1,49 @@ +:original_name: mrs_08_0026.html + +.. _mrs_08_0026: + +Related Services +================ + +Relationships with Other Services +--------------------------------- + +.. table:: **Table 1** Relationships with other services + + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Service | Relationships | + +======================================+===========================================================================================================================================+ + | Virtual Private Cloud (VPC) | MRS clusters are created in the subnets of a VPC. VPCs provide a secure, isolated, and logical network environment for your MRS clusters. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Object Storage Service (OBS) | OBS stores the following user data: | + | | | + | | - MRS job input data, such as user programs and data files | + | | - MRS job output data, such as result files and log files of jobs | + | | | + | | In MRS clusters, HDFS, Hive, MapReduce, YARN, Spark, Flume, and Loader can import or export data from OBS. | + | | | + | | MRS uses the parallel file system of OBS to provide services. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Elastic Cloud Server (ECS) | MRS uses elastic cloud servers (ECSs) as cluster nodes. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Relational Database Service (RDS) | RDS stores MRS system running data, including MRS cluster metadata. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Identity and Access Management (IAM) | IAM provides authentication for MRS. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Simple Message Notification (SMN) | MRS uses SMN to provide one-to-multiple message subscription and notification over a variety of protocols. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + | Cloud Trace Service (CTS) | CTS provides you with operation records of MRS resource operation requests and request results for querying, auditing, and backtracking. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 2** MRS operations recorded by CTS + + =================== ============= =============== + Operation Resource Type Trace Name + =================== ============= =============== + Creating a cluster cluster_mrs createCluster + Deleting a cluster cluster_mrs deleteCluster + Expanding a cluster cluster_mrs scaleOutCluster + Shrinking a cluster cluster_mrs scaleInCluster + =================== ============= =============== + +After you enable CTS, the system starts recording operations on cloud resources. You can view operation records of the last 7 days on the CTS management console. For details, see **Cloud Trace Service** > **Getting Started** > **Querying Real-Time Traces**. diff --git a/umn/source/overview/what_is_mrs.rst b/umn/source/overview/what_is_mrs.rst new file mode 100644 index 0000000..23b6e52 --- /dev/null +++ b/umn/source/overview/what_is_mrs.rst @@ -0,0 +1,90 @@ +:original_name: mrs_08_0001.html + +.. _mrs_08_0001: + +What Is MRS? +============ + +Big data is a huge challenge facing the Internet era as the data volume and types increase rapidly. Conventional data processing technologies, such as single-node storage and relational databases, are unable to solve the emerging big data problems. In this case, the Apache Software Foundation (ASF) has launched an open source Hadoop big data processing solution. Hadoop is an open source distributed computing platform that can fully utilize computing and storage capabilities of clusters to process massive amounts of data. If enterprises deploy Hadoop systems by themselves, the disadvantages include high costs, long deployment period, difficult maintenance, and inflexible use. + +To solve the preceding problems, the cloud provides MapReduce Service (MRS) for managing the Hadoop system. With MRS, you can deploy a Hadoop cluster in just one click. MRS provides enterprise-level big data clusters on the cloud. Tenants can fully control clusters and easily run big data components such as Storm, Hadoop, Spark, HBase, and Kafka. MRS is fully compatible with open source APIs, and incorporates advantages of the cloud computing and storage and big data industry experience to provide customers with a full-stack big data platform featuring high performance, low cost, flexibility, and ease-of-use. In addition, the platform can be customized based on service requirements to help enterprises quickly build a massive data processing system and discover new value points and business opportunities by analyzing and mining massive amounts of data in real time or in non-real time. + +Product Architecture +-------------------- + +:ref:`Figure 1 ` shows the MRS logical architecture. + +.. note:: + + MRS 3.x or later does not support patch management on the management console. + +.. _mrs_08_0001__fig1150416195911: + +.. figure:: /_static/images/en-us_image_0000001441155405.png + :alt: **Figure 1** MRS architecture + + **Figure 1** MRS architecture + +MRS architecture includes infrastructure and big data processing phases. + +- Infrastructure + + MRS big data clusters are built based on Elastic Cloud Server (ECS), and make full use of the high reliability and security capabilities of the virtualization layer. + + - A Virtual Private Cloud (VPC) is a virtual internal network provided for each tenant. It is isolated from other networks by default. + - Elastic Volume Service (EVS) provides highly reliable and high-performance storage. + - ECS provides scalable VMs, and works with VPCs, security groups, and the EVS multi-replica mechanism to build an efficient, reliable, and secure computing environment. + +- Data collection + + The data collection layer provides the capability of importing data from various dta sources, such as Flume (data ingestion), Loader (relational data import), and Kafka (highly reliable message queue), to MRS big data clusters. Alternatively, you can use Cloud Data Migration (CDM) to import external data to MRS clusters. + +- Data storage + + MRS clusters can store structured and unstructured data, and support multiple efficient formats to meet the requirements of different computing engines. + + - HDFS is a general-purpose distributed file system on a big data platform. + - OBS is an object storage service that features high availability and low cost. + - HBase supports data storage with indexes, and is applicable to high-performance index-based query scenarios. + +- Data convergence processing + + - MRS provides multiple mainstream compute engines, including MapReduce (batch processing), Tez (DAG model), Spark (in-memory computing), Spark Streaming (micro-batch stream computing), Storm (stream computing), and Flink (stream computing), to convert data structures and logic into data models that meet service requirements in a variety of big data application scenarios. + - Based on the preset data model and easy-to-use SQL data analysis, users can select Hive (data warehouse), SparkSQL, and Presto (interactive query engine). + +- Data display and scheduling + + Displays data analysis results and integrates with Data Lake Governance Center (DGC) to provide a one-stop big data collaborative development platform, helping you easily complete multiple tasks, such as data modeling, data integration, script development, job scheduling, and O&M monitoring, making big data more accessible than ever before, and helping you effortlessly build big data processing centers. + +- Cluster management + + All components of the Hadoop-based big data ecosystem are deployed in distributed mode, and their deployment, management, and O&M are complex. + + MRS provides a unified O&M management platform for cluster management, supporting one-click cluster deployment, multi-version selection, as well as manual scaling and auto scaling of clusters without service interruption. In addition, MRS provides job management, resource tag management, and O&M of the preceding data processing components at each layer. It also provides one-stop O&M capabilities, covering monitoring, alarm reporting, configuration, and patch upgrade. + +Product Advantages +------------------ + +MRS has a powerful Hadoop kernel team and is deployed based on enterprise-level FusionInsight big data platform. MRS has been deployed on tens of thousands of nodes and can ensure Service Level Agreements (SLAs) for multi-level users. + +MRS has the following advantages: + +- High performance + + MRS supports self-developed CarbonData storage technology. CarbonData is a high-performance big data storage solution. It allows one data set to apply to multiple scenarios and supports features, such as multi-level indexing, dictionary encoding, pre-aggregation, dynamic partitioning, and quasi-real-time data query. This improves I/O scanning and computing performance and returns analysis results of tens of billions of data records in seconds. In addition, MRS supports self-developed enhanced scheduler Superior, which breaks the scale bottleneck of a single cluster and is capable of scheduling over 10,000 nodes in a cluster. + +- Cost-effectiveness + + Based on diversified cloud infrastructure, MRS provides various computing and storage choices and separates computing from storage, delivering cost-effective massive data storage solutions. MRS supports auto scaling to address peak and off-peak service loads, releasing idle resources on the big data platform for customers. MRS clusters can be created and scaled out when you need them, and can be terminated or scaled in after you use them, minimizing cost. + +- High security + + MRS delivers enterprise-level big data multi-tenant permissions management and security management to support table-based and column-based access control and data encryption. + +- Easy O&M + + MRS provides a visualized big data cluster management platform, improving O&M efficiency. MRS supports rolling patch upgrade and provides visualized patch release information and one-click patch installation without manual intervention, ensuring long-term stability of user clusters. + +- High reliability + + The proven large-scale reliability and long-term stability of MRS meet enterprise-level high reliability requirements. In addition, MRS supports automatic data backup across AZs and regions, as well as automatic anti-affinity. It allows VMs to be distributed on different physical machines. diff --git a/umn/source/preparing_a_user/creating_a_custom_policy.rst b/umn/source/preparing_a_user/creating_a_custom_policy.rst new file mode 100644 index 0000000..6495990 --- /dev/null +++ b/umn/source/preparing_a_user/creating_a_custom_policy.rst @@ -0,0 +1,316 @@ +:original_name: mrs_01_0455.html + +.. _mrs_01_0455: + +Creating a Custom Policy +======================== + +Custom policies can be created to supplement the system-defined policies of MRS. For the actions that can be added to custom policies, see **Permissions Policies and Supported Actions** > **Introduction** in MapReduce Service API Reference. + +You can create custom policies in either of the following ways: + +- Visual editor: Select cloud services, actions, resources, and request conditions. This does not require knowledge of policy syntax. +- JSON: Edit JSON policies from scratch or based on an existing policy. + +Example Custom Policies +----------------------- + +- Example 1: Allowing users to create MRS clusters only + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:create", + "ecs:*:*", + "bms:*:*", + "evs:*:*", + "vpc:*:*", + "smn:*:*" + ] + } + ] + } + +- Example 2: Allowing users to resize an MRS cluster + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:resize" + ] + } + ] + } + +- Example 3: Allowing users to create a cluster, create and execute a job, and delete a single job, but denying cluster deletion + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:create", + "mrs:job:submit", + "mrs:job:delete" + ] + }, + { + "Effect": "Deny", + "Action": [ + "mrs:cluster:delete" + ] + } + ] + } + +- Example 4: Allowing users to create an ECS cluster with the minimum permission + + .. note:: + + - If you need a key pair when creating a cluster, add the following permissions: **ecs:serverKeypairs:get** and **ecs:serverKeypairs:list**. + - Add the **kms:cmk:list** permission when encrypting data disks during cluster creation. + - Add the **mrs:alarm:subscribe** permission to enable the alarm function during cluster creation. + - Add the **rds:instance:list** permission to use external data sources during cluster creation. + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:create" + ] + }, + { + "Effect": "Allow", + "Action": [ + "ecs:cloudServers:updateMetadata", + "ecs:cloudServerFlavors:get", + "ecs:cloudServerQuotas:get", + "ecs:servers:list", + "ecs:servers:get", + "ecs:cloudServers:delete", + "ecs:cloudServers:list", + "ecs:serverInterfaces:get", + "ecs:serverGroups:manage", + "ecs:servers:setMetadata", + "ecs:cloudServers:get", + "ecs:cloudServers:create" + ] + }, + { + "Effect": "Allow", + "Action": [ + "vpc:securityGroups:create", + "vpc:securityGroupRules:delete", + "vpc:vpcs:create", + "vpc:ports:create", + "vpc:securityGroups:get", + "vpc:subnets:create", + "vpc:privateIps:delete", + "vpc:quotas:list", + "vpc:networks:get", + "vpc:publicIps:list", + "vpc:securityGroups:delete", + "vpc:securityGroupRules:create", + "vpc:privateIps:create", + "vpc:ports:get", + "vpc:ports:delete", + "vpc:publicIps:update", + "vpc:subnets:get", + "vpc:publicIps:get", + "vpc:ports:update", + "vpc:vpcs:list" + ] + }, + { + "Effect": "Allow", + "Action": [ + "evs:quotas:get", + "evs:types:get" + ] + }, + { + "Effect": "Allow", + "Action": [ + "bms:serverFlavors:get" + ] + } + ] + } + +- Example 5: Allowing users to create a BMS cluster with the minimum permission + + .. note:: + + - If you need a key pair when creating a cluster, add the following permissions: **ecs:serverKeypairs:get** and **ecs:serverKeypairs:list**. + - Add the **kms:cmk:list** permission when encrypting data disks during cluster creation. + - Add the **mrs:alarm:subscribe** permission to enable the alarm function during cluster creation. + - Add the **rds:instance:list** permission to use external data sources during cluster creation. + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:create" + ] + }, + { + "Effect": "Allow", + "Action": [ + "ecs:servers:list", + "ecs:servers:get", + "ecs:cloudServers:delete", + "ecs:serverInterfaces:get", + "ecs:serverGroups:manage", + "ecs:servers:setMetadata", + "ecs:cloudServers:create", + "ecs:cloudServerFlavors:get", + "ecs:cloudServerQuotas:get" + ] + }, + { + "Effect": "Allow", + "Action": [ + "vpc:securityGroups:create", + "vpc:securityGroupRules:delete", + "vpc:vpcs:create", + "vpc:ports:create", + "vpc:securityGroups:get", + "vpc:subnets:create", + "vpc:privateIps:delete", + "vpc:quotas:list", + "vpc:networks:get", + "vpc:publicIps:list", + "vpc:securityGroups:delete", + "vpc:securityGroupRules:create", + "vpc:privateIps:create", + "vpc:ports:get", + "vpc:ports:delete", + "vpc:publicIps:update", + "vpc:subnets:get", + "vpc:publicIps:get", + "vpc:ports:update", + "vpc:vpcs:list" + ] + }, + { + "Effect": "Allow", + "Action": [ + "evs:quotas:get", + "evs:types:get" + ] + }, + { + "Effect": "Allow", + "Action": [ + "bms:servers:get", + "bms:servers:list", + "bms:serverQuotas:get", + "bms:servers:updateMetadata", + "bms:serverFlavors:get" + ] + } + ] + } + +- Example 6: Allowing users to create a hybrid ECS and BMS cluster with the minimum permission + + .. note:: + + - If you need a key pair when creating a cluster, add the following permissions: **ecs:serverKeypairs:get** and **ecs:serverKeypairs:list**. + - Add the **kms:cmk:list** permission when encrypting data disks during cluster creation. + - Add the **mrs:alarm:subscribe** permission to enable the alarm function during cluster creation. + - Add the **rds:instance:list** permission to use external data sources during cluster creation. + + .. code-block:: + + { + "Version": "1.1", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "mrs:cluster:create" + ] + }, + { + "Effect": "Allow", + "Action": [ + "ecs:cloudServers:updateMetadata", + "ecs:cloudServerFlavors:get", + "ecs:cloudServerQuotas:get", + "ecs:servers:list", + "ecs:servers:get", + "ecs:cloudServers:delete", + "ecs:cloudServers:list", + "ecs:serverInterfaces:get", + "ecs:serverGroups:manage", + "ecs:servers:setMetadata", + "ecs:cloudServers:get", + "ecs:cloudServers:create" + ] + }, + { + "Effect": "Allow", + "Action": [ + "vpc:securityGroups:create", + "vpc:securityGroupRules:delete", + "vpc:vpcs:create", + "vpc:ports:create", + "vpc:securityGroups:get", + "vpc:subnets:create", + "vpc:privateIps:delete", + "vpc:quotas:list", + "vpc:networks:get", + "vpc:publicIps:list", + "vpc:securityGroups:delete", + "vpc:securityGroupRules:create", + "vpc:privateIps:create", + "vpc:ports:get", + "vpc:ports:delete", + "vpc:publicIps:update", + "vpc:subnets:get", + "vpc:publicIps:get", + "vpc:ports:update", + "vpc:vpcs:list" + ] + }, + { + "Effect": "Allow", + "Action": [ + "evs:quotas:get", + "evs:types:get" + ] + }, + { + "Effect": "Allow", + "Action": [ + "bms:servers:get", + "bms:servers:list", + "bms:serverQuotas:get", + "bms:servers:updateMetadata", + "bms:serverFlavors:get" + ] + } + ] + } diff --git a/umn/source/preparing_a_user/creating_an_mrs_user.rst b/umn/source/preparing_a_user/creating_an_mrs_user.rst new file mode 100644 index 0000000..24daada --- /dev/null +++ b/umn/source/preparing_a_user/creating_an_mrs_user.rst @@ -0,0 +1,159 @@ +:original_name: mrs_01_0453.html + +.. _mrs_01_0453: + +Creating an MRS User +==================== + +Use `IAM `__ to implement fine-grained permission control over your MRS. With IAM, you can: + +- Create IAM users under your cloud account for employees based on your enterprise's organizational structure so that each employee is allowed to access MRS resources using their unique security credential (IAM user). +- Grant only the permissions required for users to perform a specific task. +- Entrust a cloud account or cloud service to perform efficient O&M on your MRS resources. + +If your cloud account does not require IAM users, skip this section. + +This section describes the procedure for granting permissions (see :ref:`Figure 1 `). + +Prerequisites +------------- + +Learn about the permissions. For the permissions of other services, see `Permission Description `__. + +Process Flow +------------ + +.. _mrs_01_0453__fig8523123435310: + +.. figure:: /_static/images/en-us_image_0000001296217532.gif + :alt: **Figure 1** Process for granting MRS permissions + + **Figure 1** Process for granting MRS permissions + +#. .. _mrs_01_0453__li895020818018: + + `Create a user group and assign permissions to it `__. + + Create a user group on the IAM console, and assign MRS permissions to the group. + +#. `Create a user and add it to a user group `__. + + Create a user on the IAM console and add the user to the group created in :ref:`1. Create a user group and assign permissions to it `. + +#. Log in and verify permissions. + + Log in to the console by using the user created, and verify that the user has the granted permissions. + + - Choose **Service List** > **MapReduce Service**. Then click **Create** **Cluster** on the MRS console. If a message appears indicating that you have insufficient permissions to perform the operation, the **MRS ReadOnlyAccess** policy has already taken effect. + - Choose any other service in **Service List**. If a message appears indicating that you have insufficient permissions to access the service, the **MRS ReadOnlyAccess** policy has already taken effect. + +MRS Permission Description +-------------------------- + +By default, new IAM users do not have any permissions. To assign permissions to a user, add the user to one or more groups and assign permissions policies or roles to these groups. The user then inherits permissions from the groups it is a member of and can perform specified operations on cloud services based on the permissions. + +MRS is a project-level service deployed and accessed in specific physical regions. To assign permissions to a user group, specify **Scope** as **Region-specific projects** and select projects in the corresponding region for the permissions to take effect. If **All projects** is selected, the permissions will take effect for the user group in all region-specific projects. When accessing MRS, the users need to switch to a region where they have been authorized to use the MRS service. + +You can grant users permissions by using roles and policies. + +- Roles: A type of coarse-grained authorization mechanism that defines permissions related to user responsibilities. This mechanism provides only a limited number of service-level roles for authorization. When using roles to grant permissions, you need to also assign other roles on which the permissions depend to take effect. However, roles are not an ideal choice for fine-grained authorization and secure access control. +- Policies: A type of fine-grained authorization mechanism that defines permissions required to perform operations on specific cloud resources under certain conditions. This mechanism allows for more flexible policy-based authorization, meeting requirements for secure access control. For example, you can grant MRS users only the permissions for performing specified operations on MRS clusters, such as creating a cluster and querying a cluster list rather than deleting a cluster. Most policies define permissions based on APIs. + +:ref:`Table 1 ` lists all the system policies supported by MRS. + +.. _mrs_01_0453__en-us_topic_0264277873_table13757124105911: + +.. table:: **Table 1** MRS system policies + + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Policy | Description | Type | + +=======================+==========================================================================================================================================+=======================+ + | MRS FullAccess | Administrator permissions for MRS. Users granted these permissions can operate and use all MRS resources. | Fine-grained policy | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | MRS CommonOperations | Common user permissions for MRS. Users granted these permissions can use MRS but cannot add or delete resources. | Fine-grained policy | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | MRS ReadOnlyAccess | Read-only permission for MRS. Users granted these permissions can only view MRS resources. | Fine-grained policy | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | MRS Administrator | Permissions: | RBAC policy | + | | | | + | | - All operations on MRS | | + | | - Users with permissions of this policy must also be granted permissions of the **Tenant Guest** and **Server Administrator** policies. | | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +:ref:`Table 2 ` lists the common operations supported by each system-defined policy or role of MRS. Select the policies or roles as required. + +.. _mrs_01_0453__en-us_topic_0264277873_table64841036185016: + +.. table:: **Table 2** Common operations supported by each system-defined policy + + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Operation | MRS FullAccess | MRS CommonOperations | MRS ReadOnlyAccess | MRS Administrator | + +===================================+================+======================+====================+===================+ + | Creating a cluster | Y | x | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Resizing a cluster | Y | x | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Upgrading node specifications | Y | x | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Deleting a cluster | Y | x | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying cluster details | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a cluster list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Configuring an auto scaling rule | Y | x | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a host list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying operation logs | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Creating and executing a job | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Stopping a job | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Deleting a single job | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Deleting jobs in batches | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying job details | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a job list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Creating a folder | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Deleting a file | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a file list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Operating cluster tags in batches | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Creating a single cluster tag | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Deleting a single cluster tag | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a resource list by tag | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying cluster tags | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Accessing Manager | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying a patch list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Installing a patch | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Uninstalling a patch | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Authorizing O&M channels | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Sharing O&M channel logs | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying an alarm list | Y | Y | Y | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Subscribing to alarm notification | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Submitting an SQL statement | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Querying SQL results | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ + | Canceling an SQL execution task | Y | Y | x | Y | + +-----------------------------------+----------------+----------------------+--------------------+-------------------+ diff --git a/umn/source/preparing_a_user/index.rst b/umn/source/preparing_a_user/index.rst new file mode 100644 index 0000000..5db9d33 --- /dev/null +++ b/umn/source/preparing_a_user/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0452.html + +.. _mrs_01_0452: + +Preparing a User +================ + +- :ref:`Creating an MRS User ` +- :ref:`Creating a Custom Policy ` +- :ref:`Synchronizing IAM Users to MRS ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + creating_an_mrs_user + creating_a_custom_policy + synchronizing_iam_users_to_mrs diff --git a/umn/source/preparing_a_user/synchronizing_iam_users_to_mrs.rst b/umn/source/preparing_a_user/synchronizing_iam_users_to_mrs.rst new file mode 100644 index 0000000..facc4d7 --- /dev/null +++ b/umn/source/preparing_a_user/synchronizing_iam_users_to_mrs.rst @@ -0,0 +1,115 @@ +:original_name: mrs_01_0495.html + +.. _mrs_01_0495: + +Synchronizing IAM Users to MRS +============================== + +IAM user synchronization is to synchronize IAM users bound with MRS policies to the MRS system and create accounts with the same usernames but different passwords as the IAM users. Then, you can use an IAM username (the password needs to be reset by user **admin** of Manager) to log in to Manager for cluster management, and submit jobs on the GUI in a cluster with Kerberos authentication enabled. + +:ref:`Table 1 ` compares IAM users' permission policies and the synchronized users' permissions on MRS. For details about the default permissions on Manager, see :ref:`Users and Permissions of MRS Clusters `. + +.. _mrs_01_0495__table3878619101919: + +.. table:: **Table 1** Policy and permission mapping after synchronization + + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | Policy Type | IAM Policy | User's Default Permissions on MRS After Synchronization | Have Permission to Perform the Synchronization | Have Permission to Submit Jobs | + +==============+===========================================================+=========================================================+===============================================================================================================================================+================================+ + | Fine-grained | MRS ReadOnlyAccess | Manager_viewer | No | No | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | | MRS CommonOperations | - Manager_viewer | No | Yes | + | | | - default | | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | | MRS FullAccess | - Manager_administrator | Yes | Yes | + | | | - Manager_auditor | | | + | | | - Manager_operator | | | + | | | - Manager_tenant | | | + | | | - Manager_viewer | | | + | | | - System_administrator | | | + | | | - default | | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | RBAC | MRS Administrator | - Manager_administrator | No | Yes | + | | | - Manager_auditor | | | + | | | - Manager_operator | | | + | | | - Manager_tenant | | | + | | | - Manager_viewer | | | + | | | - System_administrator | | | + | | | - default | | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | | Server Administrator, Tenant Guest, and MRS Administrator | - Manager_administrator | Yes | Yes | + | | | - Manager_auditor | | | + | | | - Manager_operator | | | + | | | - Manager_tenant | | | + | | | - Manager_viewer | | | + | | | - System_administrator | | | + | | | - default | | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | | Tenant Administrator | - Manager_administrator | Yes | Yes | + | | | - Manager_auditor | | | + | | | - Manager_operator | | | + | | | - Manager_tenant | | | + | | | - Manager_viewer | | | + | | | - System_administrator | | | + | | | - default | | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | Custom | Custom policy | - Manager_viewer | - If custom policies use RBAC policies as a template, refer to the RBAC policies. | Yes | + | | | - default | - If custom policies use fine-grained policies as a template, refer to the fine-grained policies. The fine-grained policies are recommended. | | + | | | - launcher-job | | | + +--------------+-----------------------------------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + +.. note:: + + To facilitate user permission management, use fine-grained policies rather than RBAC policies. In fine-grained policies, the Deny action takes precedence over other actions. + + - A user has permission to synchronize IAM users only when the user has the Tenant Administrator role or has the Server Administrator, Tenant Guest, and MRS Administrator roles at the same time. + - A user with the **action:mrs:cluster:syncUser** policy has permission to synchronize IAM users. + +Procedure +--------- + +#. Create a user and authorize the user to use MRS. For details, see :ref:`Creating an MRS User `. + +#. Log in to the MRS management console and create a cluster. For details, see :ref:`Creating a Custom Cluster `. + +#. In the left navigation pane, choose **Clusters** > **Active Clusters**. Click the cluster name to go to the cluster details page. + +#. .. _mrs_01_0495__li6999515311: + + In the **Basic Information** area on the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users. + +#. After a synchronization request is sent, choose **Operation Logs** in the left navigation pane on the MRS console to check whether the synchronization is successful. For details about the logs, see :ref:`Viewing MRS Operation Logs `. + +#. After the synchronization is successful, use the user synchronized with IAM to perform subsequent operations. + + .. note:: + + - When the policy of the user group to which the IAM user belongs changes from **MRS ReadOnlyAccess** to **MRS CommonOperations**, **MRS FullAccess**, or **MRS Administrator**, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** (System Security Services Daemon) cache of cluster nodes needs time to be updated. Then, submit a job. Otherwise, the job may fail to be submitted. + - When the policy of the user group to which the IAM user belongs changes from **MRS CommonOperations**, **MRS FullAccess**, or **MRS Administrator** to **MRS ReadOnlyAccess**, wait for 5 minutes until the new policy takes effect after the synchronization is complete because the **SSSD** cache of cluster nodes needs time to be updated. + - After you click **Synchronize** on the right side of **IAM User Sync**, the cluster details page is blank for a short time, because user data is being synchronized. The page will be properly displayed after the data synchronization is complete. + + - Submitting jobs in a security cluster: Users can submit jobs using the job management function on the GUI in the security cluster. For details, see :ref:`Running a MapReduce Job `. + - All tabs are displayed on the cluster details page, including **Components**, **Tenants**, and **Backups & Restorations**. + - Logging in to Manager + + a. Log in to Manager as user **admin**. For details, see :ref:`Accessing Manager `. + + b. .. _mrs_01_0495__li169901714175: + + Initialize the password of the user synchronized with IAM. For details, see :ref:`Initializing the Password of a System User `. + + c. Modify the role bound to the user group to which the user belongs to control user permissions on Manager. For details, see :ref:`Related Tasks `. For details about how to create and modify a role, see :ref:`Creating a Role `. After the component role bound to the user group to which the user belongs is modified, it takes some time for the role permissions to take effect. + + d. Log in to Manager using the user synchronized with IAM and the password after the initialization in :ref:`6.b `. + + .. note:: + + If the IAM user's permission changes, go to :ref:`4 ` to perform second synchronization. After the second synchronization, a system user's permissions are the union of the permissions defined in the IAM system policy and the permissions of roles added by the system user on Manager. After the second synchronization, a custom user's permissions are subject to the permissions configured on Manager. + + - System user: If all user groups to which an IAM user belongs are bound to system policies (RABC policies and fine-grained policies belong to system policies), the IAM user is a system user. + - Custom user: If the user group to which an IAM user belongs is bound to any custom policy, the IAM user is a custom user. diff --git a/umn/source/security_description/index.rst b/umn/source/security_description/index.rst new file mode 100644 index 0000000..6a8efa8 --- /dev/null +++ b/umn/source/security_description/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0528.html + +.. _mrs_01_0528: + +Security Description +==================== + +- :ref:`Security Configuration Suggestions for Clusters with Kerberos Authentication Disabled ` +- :ref:`Security Authentication Principles and Mechanisms ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + security_configuration_suggestions_for_clusters_with_kerberos_authentication_disabled + security_authentication_principles_and_mechanisms diff --git a/umn/source/security_description/security_authentication_principles_and_mechanisms.rst b/umn/source/security_description/security_authentication_principles_and_mechanisms.rst new file mode 100644 index 0000000..bbd30a4 --- /dev/null +++ b/umn/source/security_description/security_authentication_principles_and_mechanisms.rst @@ -0,0 +1,181 @@ +:original_name: mrs_07_020001.html + +.. _mrs_07_020001: + +Security Authentication Principles and Mechanisms +================================================= + +Function +-------- + +For clusters in security mode with Kerberos authentication enabled, security authentication is required during application development. + +Kerberos, named after the ferocious three-headed guard dog of Hades from Greek mythology, is now used to a concept in authentication. The Kerberos protocol adopts a client-server model and cryptographic algorithms such as AES (Advanced Encryption Standard). It provides mutual authentication, that is, both the client and the server can verify each other's identity. Kerberos is used to prevent interception and replay attacks and protect data integrity. It is a system that manages keys by using a symmetric key mechanism. + +Architecture +------------ + +Kerberos architecture is shown in :ref:`Figure 1 ` and module description in :ref:`Table 1 `. + +.. _mrs_07_020001__en-us_topic_0269458072_f203af1d5f55e4832ba0f6891e6c38839: + +.. figure:: /_static/images/en-us_image_0000001349137697.png + :alt: **Figure 1** Kerberos architecture + + **Figure 1** Kerberos architecture + +.. _mrs_07_020001__en-us_topic_0269458072_t16ccb1745d994539a7b6d7b4d728ff02: + +.. table:: **Table 1** Module description + + +--------------------+--------------------------------------------------------------------------------------------+ + | Module | Description | + +====================+============================================================================================+ + | Application Client | An application client, which is usually an application that submits tasks or jobs | + +--------------------+--------------------------------------------------------------------------------------------+ + | Application Server | An application server, which is usually an application that an application client accesses | + +--------------------+--------------------------------------------------------------------------------------------+ + | Kerberos | A service that provides security authentication | + +--------------------+--------------------------------------------------------------------------------------------+ + | KerberosAdmin | A process that provides authentication user management | + +--------------------+--------------------------------------------------------------------------------------------+ + | KerberosServer | A process that provides authentication ticket distribution | + +--------------------+--------------------------------------------------------------------------------------------+ + +The process and principle are described as follows: + +An application client can be a service in a cluster or a secondary development application of the customer. An application client can submit tasks or jobs to an application service. + +#. Before submitting a task or job, the application client needs to apply for a ticket granting ticket (TGT) from the Kerberos service to establish a secure session with the Kerberos server. +#. After receiving the TGT request, the Kerberos service resolves parameters in the request to generate a TGT, and uses the key of the username specified by the client to encrypt the response. +#. After receiving the TGT response, the application client (based on the underlying RPC) resolves the response and obtains the TGT, and then applies for a server ticket (ST) of the application server from the Kerberos service. +#. After receiving the ST request, the Kerberos service verifies the TGT validity in the request and generates an ST of the application service, and then uses the application service key to encrypt the response. +#. After receiving the ST response, the application client packages the ST into a request and sends the request to the application server. +#. After receiving the request, the application server uses its local application service key to resolve the ST. After successful verification, the request becomes valid. + +Basic Concepts +-------------- + +The following concepts can help users learn the Kerberos architecture quickly and understand the Kerberos service better. The following uses security authentication for HDFS as an example. + +**TGT** + +A TGT is generated by the Kerberos service and used to establish a secure session between an application and the Kerberos server. The validity period of a TGT is 24 hours. After 24 hours, the TGT expires automatically. + +The following describes how to apply for a TGT (HDFS is used as an example): + +#. Obtain a TGT through an API provided by HDFS. + + .. code-block:: + + /** + * login Kerberos to get TGT, if the cluster is in security mode + * @throws IOException if login is failed + */ + private void login() throws IOException { + // not security mode, just return + if (! "kerberos".equalsIgnoreCase(conf.get("hadoop.security.authentication"))) { + return; + } + + //security mode + System.setProperty("java.security.krb5.conf", PATH_TO_KRB5_CONF); + + UserGroupInformation.setConfiguration(conf); + UserGroupInformation.loginUserFromKeytab(PRNCIPAL_NAME, PATH_TO_KEYTAB); + } + +#. Run shell commands on the client in kinit mode. + +**ST** + +An ST is generated by the Kerberos service and used to establish a secure session between an application and application service. An ST is valid only once. + +In FusionInsight products, the generation of an ST is based on the Hadoop-RPC communication. The underlying RPC submits a request to the Kerberos server and the Kerberos server generates an ST. + +Sample Authentication Code +-------------------------- + +.. code-block:: + + import java.io.IOException; + + import org.apache.hadoop.conf.Configuration; + import org.apache.hadoop.fs.FileStatus; + import org.apache.hadoop.fs.FileSystem; + import org.apache.hadoop.fs.Path; + import org.apache.hadoop.security.UserGroupInformation; + + public class KerberosTest { + private static String PATH_TO_HDFS_SITE_XML = KerberosTest.class.getClassLoader().getResource("hdfs-site.xml") + .getPath(); + private static String PATH_TO_CORE_SITE_XML = KerberosTest.class.getClassLoader().getResource("core-site.xml") + .getPath(); + private static String PATH_TO_KEYTAB = KerberosTest.class.getClassLoader().getResource("user.keytab").getPath(); + private static String PATH_TO_KRB5_CONF = KerberosTest.class.getClassLoader().getResource("krb5.conf").getPath(); + private static String PRNCIPAL_NAME = "develop"; + private FileSystem fs; + private Configuration conf; + + /** + * initialize Configuration + */ + private void initConf() { + conf = new Configuration(); + + // add configuration files + conf.addResource(new Path(PATH_TO_HDFS_SITE_XML)); + conf.addResource(new Path(PATH_TO_CORE_SITE_XML)); + } + + /** + * login Kerberos to get TGT, if the cluster is in security mode + * @throws IOException if login is failed + */ + private void login() throws IOException { + // not security mode, just return + if (! "kerberos".equalsIgnoreCase(conf.get("hadoop.security.authentication"))) { + return; + } + + //security mode + System.setProperty("java.security.krb5.conf", PATH_TO_KRB5_CONF); + + UserGroupInformation.setConfiguration(conf); + UserGroupInformation.loginUserFromKeytab(PRNCIPAL_NAME, PATH_TO_KEYTAB); + } + + /** + * initialize FileSystem, and get ST from Kerberos + * @throws IOException + */ + private void initFileSystem() throws IOException { + fs = FileSystem.get(conf); + } + + /** + * An example to access the HDFS + * @throws IOException + */ + private void doSth() throws IOException { + Path path = new Path("/tmp"); + FileStatus fStatus = fs.getFileStatus(path); + System.out.println("Status of " + path + " is " + fStatus); + //other thing + } + + + public static void main(String[] args) throws Exception { + KerberosTest test = new KerberosTest(); + test.initConf(); + test.login(); + test.initFileSystem(); + test.doSth(); + } + } + +.. note:: + + #. During Kerberos authentication, you need to configure the file parameters required for configuring the Kerberos authentication, including the keytab path, Kerberos authentication username, and the **krb5.conf** configuration file of the client for Kerberos authentication. + #. Method **login()** indicates calling the Hadoop API to perform Kerberos authentication and generating a TGT. + #. Method **doSth** indicates calling the Hadoop API to access the file system. In this situation, the underlying RPC automatically carries the TGT to Kerberos for verification and then an ST is generated. diff --git a/umn/source/security_description/security_configuration_suggestions_for_clusters_with_kerberos_authentication_disabled.rst b/umn/source/security_description/security_configuration_suggestions_for_clusters_with_kerberos_authentication_disabled.rst new file mode 100644 index 0000000..1090f63 --- /dev/null +++ b/umn/source/security_description/security_configuration_suggestions_for_clusters_with_kerberos_authentication_disabled.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0419.html + +.. _mrs_01_0419: + +Security Configuration Suggestions for Clusters with Kerberos Authentication Disabled +===================================================================================== + +The Hadoop community version provides two authentication modes: Kerberos authentication (security mode) and Simple authentication (normal mode). When creating a cluster, you can choose to enable or disable Kerberos authentication. + +Clusters in security mode use the Kerberos protocol for security authentication. + +In normal mode, MRS cluster components use a native open source authentication mechanism, which is typically Simple authentication. If Simple authentication is used, authentication is automatically performed by a client user (for example, user **root**) by default when a client connects to a server. The authentication is imperceptible to the administrator or service user. In addition, when being executed, the client may even pretend to be any user (including **superuser**) by injecting **UserGroupInformation**. Cluster resource management and data control APIs are not authenticated on the server and are easily exploited and attacked by hackers. + +Therefore, in normal mode, network access permissions must be strictly controlled to ensure cluster security. You are advised to perform the following operations to ensure cluster security. + +- Deploy service applications on ECSs in the same VPC and subnet and avoid accessing MRS clusters through an external network. +- Configure security group rules to strictly control the access scope. Do not configure access rules that allow **Any** or **0.0.0.0** for the inbound direction of MRS cluster ports. +- If you want to access the native pages of the components in the cluster from the external, follow instructions in :ref:`Creating an SSH Channel for Connecting to an MRS Cluster and Configuring the Browser ` for configuration. diff --git a/umn/source/using_an_mrs_client/index.rst b/umn/source/using_an_mrs_client/index.rst new file mode 100644 index 0000000..1461158 --- /dev/null +++ b/umn/source/using_an_mrs_client/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0088.html + +.. _mrs_01_0088: + +Using an MRS Client +=================== + +- :ref:`Installing a Client ` +- :ref:`Updating a Client ` +- :ref:`Using the Client of Each Component ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_a_client/index + updating_a_client/index + using_the_client_of_each_component/index diff --git a/umn/source/using_an_mrs_client/installing_a_client/index.rst b/umn/source/using_an_mrs_client/installing_a_client/index.rst new file mode 100644 index 0000000..2471218 --- /dev/null +++ b/umn/source/using_an_mrs_client/installing_a_client/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24212.html + +.. _mrs_01_24212: + +Installing a Client +=================== + +- :ref:`Installing a Client (Version 3.x or Later) ` +- :ref:`Installing a Client (Versions Earlier Than 3.x) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_a_client_version_3.x_or_later + installing_a_client_versions_earlier_than_3.x diff --git a/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_version_3.x_or_later.rst b/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_version_3.x_or_later.rst new file mode 100644 index 0000000..1f4bda1 --- /dev/null +++ b/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_version_3.x_or_later.rst @@ -0,0 +1,231 @@ +:original_name: mrs_01_0090.html + +.. _mrs_01_0090: + +Installing a Client (Version 3.x or Later) +========================================== + +Scenario +-------- + +This section describes how to install clients of all services (excluding Flume) in an MRS cluster. For details about how to install the Flume client, see `Installing the Flume Client `__. + +A client can be installed on a node inside or outside the cluster. This section uses the installation directory **//opt/client** as an example. Replace it with the actual one. + +.. _mrs_01_0090__en-us_topic_0270713152_en-us_topic_0264269418_section3219221104310: + +Prerequisites +------------- + +- A Linux ECS has been prepared. For details about the supported OS of the ECS, see :ref:`Table 1 `. + + .. _mrs_01_0090__en-us_topic_0270713152_en-us_topic_0264269418_table40818788104630: + + .. table:: **Table 1** Reference list + + +-------------------------+--------+-------------------------------------------------+ + | CPU Architecture | OS | Supported Version | + +=========================+========+=================================================+ + | x86 computing | Euler | EulerOS 2.5 | + +-------------------------+--------+-------------------------------------------------+ + | | SUSE | SUSE Linux Enterprise Server 12 SP4 (SUSE 12.4) | + +-------------------------+--------+-------------------------------------------------+ + | | RedHat | Red Hat-7.5-x86_64 (Red Hat 7.5) | + +-------------------------+--------+-------------------------------------------------+ + | | CentOS | CentOS 7.6 | + +-------------------------+--------+-------------------------------------------------+ + | Kunpeng computing (Arm) | Euler | EulerOS 2.8 | + +-------------------------+--------+-------------------------------------------------+ + | | CentOS | CentOS 7.6 | + +-------------------------+--------+-------------------------------------------------+ + + In addition, sufficient disk space is allocated for the ECS, for example, 40 GB. + +- The ECS and the MRS cluster are in the same VPC. + +- The security group of the ECS must be the same as that of the master node in the MRS cluster. + +- The NTP service has been installed on the ECS OS and is running properly. + + If the NTP service is not installed, run the **yum install ntp -y** command to install it when the **yum** source is configured. + +- A user can log in to the Linux ECS using the password (in SSH mode). + +.. _mrs_01_0090__section181806577218: + +Installing a Client on a Node Inside a Cluster +---------------------------------------------- + +#. Obtain the software package. + + Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Click the name of the cluster to be operated in the **Cluster** drop-down list. + + Choose **More > Download Client**. The **Download Cluster Client** dialog box is displayed. + + .. note:: + + In the scenario where only one client is to be installed, choose **Cluster >** **Service >** *Service name* **> More > Download Client**. The **Download Client** dialog box is displayed. + +#. Set the client type to **Complete Client**. + + **Configuration Files Only** is to download client configuration files in the following scenario: After a complete client is downloaded and installed and administrators modify server configurations on Manager, developers need to update the configuration files during application development. + + The platform type can be set to **x86_64** or **aarch64**. + + - **x86_64**: indicates the client software package that can be deployed on the x86 servers. + - **aarch64**: indicates the client software package that can be deployed on the TaiShan servers. + + .. note:: + + The cluster supports two types of clients: **x86_64** and **aarch64**. The client type must match the architecture of the node for installing the client. Otherwise, client installation will fail. + +#. Select **Save to Path** and click **OK** to generate the client file. + + The generated file is stored in the **/tmp/FusionInsight-Client** directory on the active management node by default. You can also store the client file in a directory on which user **omm** has the read, write, and execute permissions. Copy the software package to the file directory on the server where the client is to be installed as user **omm** or **root**. + + The name of the client software package is in the follow format: **FusionInsight_Cluster\_\ <**\ *Cluster ID*\ **>\ \_Services_Client.tar**. In this section, the cluster ID **1** is used as an example. Replace it with the actual cluster ID. + + The following steps and sections use **FusionInsight_Cluster_1_Services_Client.tar** as an example. + + .. note:: + + If you cannot obtain the permissions of user **root**, use user **omm**. + + To install the client on another node in the cluster, run the following command to copy the client to the node where the client is to be installed: + + **scp -p /**\ *tmp/FusionInsight-Client*\ **/FusionInsight_Cluster_1_Services_Client.tar** *IP address of the node where the client is to be installed:/opt/Bigdata/client* + +#. Log in to the server where the client software package is located as user **user_client**. + +#. Decompress the software package. + + Go to the directory where the installation package is stored, such as **/tmp/FusionInsight-Client**. Run the following command to decompress the installation package to a local directory: + + **tar -xvf** **FusionInsight_Cluster_1_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file. + + **sha256sum -c** **FusionInsight_Cluster_1_Services_ClientConfig.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig.tar: OK + +#. Decompress the obtained installation file. + + **tar -xvf** **FusionInsight_Cluster_1_Services_ClientConfig.tar** + +#. Go to the directory where the installation package is stored, and run the following command to install the client to a specified directory (an absolute path), for example, **/opt/client**: + + **cd /tmp/FusionInsight-Client/FusionInsight\_Cluster_1_Services_ClientConfig** + + Run the **./install.sh /opt/client** command to install the client. The client is successfully installed if information similar to the following is displayed: + + .. code-block:: + + The component client is installed successfully + + .. note:: + + - If the clients of all or some services use the **/opt/client** directory, other directories must be used when you install other service clients. + - You must delete the client installation directory when uninstalling a client. + - To ensure that an installed client can only be used by the installation user (for example, **user_client**), add parameter **-o** during the installation. That is, run the **./install.sh /opt/client -o** command to install the client. + - If an HBase client is installed, it is recommended that the client installation directory contain only uppercase and lowercase letters, digits, and characters ``(_-?.@+=)`` due to the limitation of the Ruby syntax used by HBase. + +Using a Client +-------------- + +#. On the node where the client is installed, run the **sudo su - omm** command to switch the user. Run the following command to go to the client directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + .. note:: + + User **admin** is created by default for MRS clusters with Kerberos authentication enabled and is used for administrators to maintain the clusters. + +#. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. + +Installing a Client on a Node Outside a Cluster +----------------------------------------------- + +#. Create an ECS that meets the requirements in :ref:`Prerequisites `. +#. Perform NTP time synchronization to synchronize the time of nodes outside the cluster with that of the MRS cluster. + + a. Run the **vi /etc/ntp.conf** command to edit the NTP client configuration file, add the IP addresses of the master node in the MRS cluster, and comment out the IP address of other servers. + + .. code-block:: + + server master1_ip prefer + server master2_ip + + + .. figure:: /_static/images/en-us_image_0000001441097913.png + :alt: **Figure 1** Adding the master node IP addresses + + **Figure 1** Adding the master node IP addresses + + b. Run the **service ntpd stop** command to stop the NTP service. + + c. Run the following command to manually synchronize the time: + + **/usr/sbin/ntpdate** *192.168.10.8* + + .. note:: + + **192.168.10.8** indicates the IP address of the active Master node. + + d. Run the **service ntpd start** or **systemctl restart ntpd** command to start the NTP service. + + e. Run the **ntpstat** command to check the time synchronization result. + +#. Perform the following steps to download the cluster client software package from FusionInsight Manager, copy the package to the ECS node, and install the client: + + a. Log in to FusionInsight Manager and download the cluster client to the specified directory on the active management node by referring to :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) ` and :ref:`Installing a Client on a Node Inside a Cluster `. + + b. Log in to the active management node as user **root** and run the following command to copy the client installation package to the target node: + + **scp -p /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_Client.tar** *IP address of the node where the client is to be installed*\ **:/tmp** + + c. Log in to the node on which the client is to be installed as the client user. + + Run the following commands to install the client. If the user does not have operation permissions on the client software package and client installation directory, grant the permissions using the **root** user. + + **cd /tmp** + + **tar -xvf** **FusionInsight_Cluster_1_Services_Client.tar** + + **tar -xvf** **FusionInsight_Cluster_1_Services_ClientConfig.tar** + + **cd FusionInsight\_Cluster_1_Services_ClientConfig** + + **./install.sh /opt/client** + + d. Run the following commands to switch to the client directory and configure environment variables: + + **cd /opt/client** + + **source bigdata_env** + + e. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + f. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. diff --git a/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_versions_earlier_than_3.x.rst b/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_versions_earlier_than_3.x.rst new file mode 100644 index 0000000..c9ddb82 --- /dev/null +++ b/umn/source/using_an_mrs_client/installing_a_client/installing_a_client_versions_earlier_than_3.x.rst @@ -0,0 +1,260 @@ +:original_name: mrs_01_0091.html + +.. _mrs_01_0091: + +Installing a Client (Versions Earlier Than 3.x) +=============================================== + +Scenario +-------- + +An MRS client is required. The MRS cluster client can be installed on the Master or Core node in the cluster or on a node outside the cluster. + +After a cluster of versions earlier than MRS 3.x is created, a client is installed on the active Master node by default. You can directly use the client. The installation directory is **/opt/client**. + +For details about how to install a client of MRS 3.x or later, see :ref:`Installing a Client (Version 3.x or Later) `. + +.. note:: + + If a client has been installed on the node outside the MRS cluster and the client only needs to be updated, update the client using the user who installed the client, for example, user **root**. + +Prerequisites +------------- + +- An ECS has been prepared. For details about the OS and its version of the ECS, see :ref:`Table 1 `. + + .. _mrs_01_0091__table40818788104630: + + .. table:: **Table 1** Reference list + + +-----------------------------------+-----------------------------------+ + | OS | Supported Version | + +===================================+===================================+ + | EulerOS | - Available: EulerOS 2.2 | + | | - Available: EulerOS 2.3 | + | | - Available: EulerOS 2.5 | + +-----------------------------------+-----------------------------------+ + + For example, a user can select the enterprise image **Enterprise_SLES11_SP4_latest(4GB)** or standard image **Standard_CentOS_7.2_latest(4GB)** to prepare the OS for an ECS. + + In addition, sufficient disk space is allocated for the ECS, for example, 40 GB. + +- The ECS and the MRS cluster are in the same VPC. + +- The security group of the ECS is the same as that of the Master node of the MRS cluster. + + If this requirement is not met, modify the ECS security group or configure the inbound and outbound rules of the ECS security group to allow the ECS security group to be accessed by all security groups of MRS cluster nodes. + +- To enable users to log in to a Linux ECS using a password (SSH), see **Instances** *>* **Logging In to a Linux ECS** *>* **Login Using an SSH Password** *in the Elastic Cloud Server User Guide*. + +Installing a Client on the Core Node +------------------------------------ + +#. Log in to MRS Manager and choose **Services** > **Download Client** to download the client installation package to the active management node. + + .. note:: + + If only the client configuration file needs to be updated, see method 2 in :ref:`Updating a Client (Versions Earlier Than 3.x) `. + +#. Use the IP address to search for the active management node, and log in to the active management node using VNC. + +#. Log in to the active management node, and run the following command to switch the user: + + **sudo su - omm** + +#. On the MRS management console, view the IP address on the **Nodes** tab page of the specified cluster. + + Record the IP address of the Core node where the client is to be used. + +#. On the active management node, run the following command to copy the client installation package to the Core node: + + **scp -p /tmp/MRS-client/MRS\_Services_Client.tar** *IP address of the Core node*\ **:/opt/client** + +#. Log in to the Core node as user **root**. + + For details, see `Login Using an SSH Key `__. + +#. Run the following commands to install the client: + + **cd /opt/client** + + **tar -xvf** **MRS\_Services_Client.tar** + + **tar -xvf MRS\ \_\ Services_ClientConfig.tar** + + **cd /opt/client/MRS\_Services_ClientConfig** + + **./install.sh** *Client installation directory* + + For example, run the following command: + + **./install.sh /opt/client** + +#. For details about how to use the client, see :ref:`Using an MRS Client `. + +.. _mrs_01_0091__section8796733802: + +Using an MRS Client +------------------- + +#. On the node where the client is installed, run the **sudo su - omm** command to switch the user. Run the following command to go to the client directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + .. note:: + + User **admin** is created by default for MRS clusters with Kerberos authentication enabled and is used for administrators to maintain the clusters. + +#. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. + +Installing a Client on a Node Outside the Cluster +------------------------------------------------- + +#. Create an ECS that meets the requirements in the prerequisites. + +#. .. _mrs_01_0091__li1148114223118: + + Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Then, choose **Services**. + +#. Click **Download Client**. + +#. In **Client Type**, select **All client files**. + +#. In **Download To**, select **Remote host**. + +#. .. _mrs_01_0091__li24260068101924: + + Set **Host IP Address** to the IP address of the ECS, **Host Port** to **22**, and **Save Path** to **/home/linux**. + + - If the default port **22** for logging in to an ECS using SSH has been changed, set **Host Port** to the new port. + - **Save Path** contains a maximum of 256 characters. + +#. Set **Login User** to **root**. + + If other users are used, ensure that the users have read, write, and execute permission on the save path. + +#. In **SSH Private Key**, select and upload the key file used for creating cluster B. + +#. Click **OK** to generate a client file. + + If the following information is displayed, the client package is saved. Click **Close**. Obtain the client file from the save path on the remote host that is set when the client is downloaded. + + .. code-block:: text + + Client files downloaded to the remote host successfully. + + If the following information is displayed, check the username, password, and security group configurations of the remote host. Ensure that the username and password are correct and an inbound rule of the SSH (22) port has been added to the security group of the remote host. And then, go to :ref:`2 ` to download the client again. + + .. code-block:: text + + Failed to connect to the server. Please check the network connection or parameter settings. + + .. note:: + + Generating a client will occupy a large number of disk I/Os. You are advised not to download a client when the cluster is being installed, started, and patched, or in other unstable states. + +#. Log in to the ECS using VNC. For details, see **Instance** > **Logging In to a Linux** > **Logging In to a Linux** in the *Elastic Cloud Server* *User Guide* + + Log in to the ECS. For details, see `Login Using an SSH Key `__. Set the ECS password and log in to the ECS in VNC mode. + +#. Perform NTP time synchronization to synchronize the time of nodes outside the cluster with the time of the MRS cluster. + + a. Check whether the NTP service is installed. If it is not installed, run the **yum install ntp -y** command to install it. + + b. Run the **vim /etc/ntp.conf** command to edit the NTP client configuration file, add the IP address of the Master node in the MRS cluster, and comment out the IP addresses of other servers. + + .. code-block:: + + server master1_ip prefer + server master2_ip + + + .. figure:: /_static/images/en-us_image_0000001390618644.png + :alt: **Figure 1** Adding the Master node IP addresses + + **Figure 1** Adding the Master node IP addresses + + c. Run the **service ntpd stop** command to stop the NTP service. + + d. Run the following command to manually synchronize the time: + + **/usr/sbin/ntpdate** *192.168.10.8* + + .. note:: + + **192.168.10.8** indicates the IP address of the active Master node. + + e. Run the **service ntpd start** or **systemctl restart ntpd** command to start the NTP service. + + f. Run the **ntpstat** command to check the time synchronization result: + +#. On the ECS, switch to user **root** and copy the installation package in **Save Path** in :ref:`6 ` to the **/opt** directory. For example, if **Save Path** is set to **/home/linux**, run the following commands: + + **sudo su - root** + + **cp /home/linux/MRS_Services_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS\_Services_Client.tar** + +#. Run the following command to verify the configuration file package of the client: + + **sha256sum -c MRS\_Services_ClientConfig.tar.sha256** + + The command output is as follows: + + .. code-block:: + + MRS_Services_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Services_ClientConfig.tar**: + + **tar -xvf MRS\_Services_ClientConfig.tar** + +#. Run the following command to install the client to a new directory, for example, **/opt/Bigdata/client**. A directory is automatically generated during the client installation. + + **sh /opt/MRS\_Services_ClientConfig/install.sh /opt/Bigdata/client** + + If the following information is displayed, the client has been successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Check whether the IP address of the ECS node is connected to the IP address of the cluster Master node. + + For example, run the following command: **ping** *Master node IP address*. + + - If yes, go to :ref:`18 `. + - If no, check whether the VPC and security group are correct and whether the ECS and the MRS cluster are in the same VPC and security group, and go to :ref:`18 `. + +#. .. _mrs_01_0091__li6406429718107: + + Run the following command to configure environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Run the client command of a component. + + For example, run the following command to query the HDFS directory: + + **hdfs dfs -ls /** diff --git a/umn/source/using_an_mrs_client/updating_a_client/index.rst b/umn/source/using_an_mrs_client/updating_a_client/index.rst new file mode 100644 index 0000000..b72d043 --- /dev/null +++ b/umn/source/using_an_mrs_client/updating_a_client/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24213.html + +.. _mrs_01_24213: + +Updating a Client +================= + +- :ref:`Updating a Client (Version 3.x or Later) ` +- :ref:`Updating a Client (Versions Earlier Than 3.x) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + updating_a_client_version_3.x_or_later + updating_a_client_versions_earlier_than_3.x diff --git a/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_version_3.x_or_later.rst b/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_version_3.x_or_later.rst new file mode 100644 index 0000000..126e8b7 --- /dev/null +++ b/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_version_3.x_or_later.rst @@ -0,0 +1,84 @@ +:original_name: mrs_01_24209.html + +.. _mrs_01_24209: + +Updating a Client (Version 3.x or Later) +======================================== + +A cluster provides a client for you to connect to a server, view task results, or manage data. If you modify service configuration parameters on Manager and restart the service, you need to download and install the client again or use the configuration file to update the client. + +Updating the Client Configuration +--------------------------------- + +**Method 1**: + +#. Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Click the name of the cluster to be operated in the **Cluster** drop-down list. + +#. Choose **More** > **Download Client** > **Configuration Files Only**. + + The generated compressed file contains the configuration files of all services. + +#. Determine whether to generate a configuration file on the cluster node. + + - If yes, select **Save to Path**, and click **OK** to generate the client file. By default, the client file is generated in **/tmp/FusionInsight-Client** on the active management node. You can also store the client file in other directories, and user **omm** has the read, write, and execute permissions on the directories. Then go to :ref:`4 `. + - If no, click **OK**, specify a local save path, and download the complete client. Wait until the download is complete and go to :ref:`4 `. + +#. .. _mrs_01_24209__admin_guide_000173_en-us_topic_0193213946_l6af983f03121493ca3526296f5b650c3: + + Use WinSCP to save the compressed file to the client installation directory, for example, **/opt/hadoopclient**, as the client installation user. + +#. Decompress the software package. + + Run the following commands to go to the directory where the client is installed, and decompress the file to a local directory. For example, the downloaded client file is **FusionInsight_Cluster_1_Services_Client.tar**. + + **cd /opt/hadoopclient** + + **tar -xvf** **FusionInsight_Cluster_1\_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file. + + **sha256sum -c** **FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig_ConfigFiles.tar: OK + +#. Decompress the package to obtain the configuration file. + + **tar -xvf FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar** + +#. Run the following command in the client installation directory to update the client using the configuration file: + + **sh refreshConfig.sh** *Client installation directory* *Directory where the configuration file is located* + + For example, run the following command: + + **sh refreshConfig.sh** **/opt/hadoopclient /opt/hadoop\ client/FusionInsight\_Cluster_1_Services_ClientConfig\_ConfigFiles** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. + +**Method 2**: + +#. Log in to the client installation node as user **root**. + +#. Go to the client installation directory, for example, **/opt/hadoopclient** and run the following commands to update the configuration file: + + **cd /opt/hadoopclient** + + **sh autoRefreshConfig.sh** + +#. Enter the username and password of the FusionInsight Manager administrator and the floating IP address of FusionInsight Manager. + +#. Enter the names of the components whose configuration needs to be updated. Use commas (,) to separate the component names. Press **Enter** to update the configurations of all components if necessary. + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. diff --git a/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_versions_earlier_than_3.x.rst b/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_versions_earlier_than_3.x.rst new file mode 100644 index 0000000..857d6f1 --- /dev/null +++ b/umn/source/using_an_mrs_client/updating_a_client/updating_a_client_versions_earlier_than_3.x.rst @@ -0,0 +1,197 @@ +:original_name: mrs_01_0089.html + +.. _mrs_01_0089: + +Updating a Client (Versions Earlier Than 3.x) +============================================= + +.. note:: + + This section applies to clusters of versions earlier than MRS 3.x. For MRS 3.x or later, see :ref:`Updating a Client (Version 3.x or Later) `. + +Updating a Client Configuration File +------------------------------------ + +**Scenario** + +An MRS cluster provides a client for you to connect to a server, view task results, or manage data. Before using an MRS client, you need to download and update the client configuration file if service configuration parameters are modified and a service is restarted or the service is merely restarted on MRS Manager. + +During cluster creation, the original client is stored in the **/opt/client** directory on all nodes in the cluster by default. After the cluster is created, only the client of a Master node can be directly used. To use the client of a Core node, you need to update the client configuration file first. + +**Procedure** + +**Method 1:** + +#. Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Then, choose **Services**. + +#. Click **Download Client**. + + Set **Client Type** to **Only configuration files**, **Download To** to **Server**, and click **OK** to generate the client configuration file. The generated file is saved in the **/tmp/MRS-client** directory on the active management node by default. You can customize the file path. + +#. Query and log in to the active Master node. + +#. If you use the client in the cluster, run the following command to switch to user **omm**. If you use the client outside the cluster, switch to user **root**. + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command to update client configurations: + + **sh refreshConfig.sh** *Client installation directory* *Full path of the client configuration file package* + + For example, run the following command: + + **sh refreshConfig.sh /opt/Bigdata/client /tmp/MRS-client/MRS_Services_Client.tar** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + ReFresh components client config is complete. + Succeed to refresh components client config. + +**Method 2: applicable to MRS 1.9.2 or later** + +#. After the cluster is installed, run the following command to switch to user **omm**. If you use the client outside the cluster, switch to user **root**. + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command and enter the name of an MRS Manager user with the download permission and its password (for example, the username is **admin** and the password is the one set during cluster creation) as prompted to update client configurations. + + **sh autoRefreshConfig.sh** + +#. After the command is executed, the following information is displayed, where *XXX* indicates the name of the component installed in the cluster. To update client configurations of all components, press **Enter**. To update client configurations of some components, enter the component names and separate them with commas (,). + + .. code-block:: + + Components "xxx" have been installed in the cluster. Please input the comma-separated names of the components for which you want to update client configurations. If you press Enter without inputting any component name, the client configurations of all components will be updated: + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. + + If the following information is displayed, the username or password is incorrect. + + .. code-block:: + + login manager failed,Incorrect username or password. + + .. note:: + + - This script automatically connects to the cluster and invokes the **refreshConfig.sh** script to download and update the client configuration file. + - By default, the client uses the floating IP address specified by **wsom=xxx** in the **Version** file in the installation directory to update the client configurations. To update the configuration file of another cluster, modify the value of **wsom=xxx** in the **Version** file to the floating IP address of the corresponding cluster before performing this step. + +Fully Updating the Original Client of the Active Master Node +------------------------------------------------------------ + +**Scenario** + +During cluster creation, the original client is stored in the **/opt/client** directory on all nodes in the cluster by default. The following uses **/opt/Bigdata/client** as an example. + +- For a normal MRS cluster, you will use the pre-installed client on a Master node to submit a job on the management console page. +- You can also use the pre-installed client on the Master node to connect to a server, view task results, and manage data. + +After installing the patch on the cluster, you need to update the client on the Master node to ensure that the functions of the built-in client are available. + +**Procedure** + +#. .. _mrs_01_0089__li6500547131416: + + Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Then, choose **Services**. + +#. Click **Download Client**. + + Set **Client Type** to **All client files**, **Download To** to **Server**, and click **OK** to generate the client configuration file. The generated file is saved in the **/tmp/MRS-client** directory on the active management node by default. You can customize the file path. + +#. .. _mrs_01_0089__li14850170195112: + + Query and log in to the active Master node. + +#. .. _mrs_01_0089__li3635762195625: + + On the ECS, switch to user **root** and copy the installation package to the **/opt** directory. + + **sudo su - root** + + **cp /tmp/MRS-client/MRS_Services_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS\_Services_Client.tar** + +#. Run the following command to verify the configuration file package of the client: + + **sha256sum -c MRS\_Services_ClientConfig.tar.sha256** + + The command output is as follows: + + .. code-block:: + + MRS_Services_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Services_ClientConfig.tar**: + + **tar -xvf MRS\_Services_ClientConfig.tar** + +#. Run the following command to move the original client to the **/opt/Bigdata/client_bak** directory: + + **mv /opt/Bigdata/client** **/opt/Bigdata/client_bak** + +#. Run the following command to install the client in a new directory. The client path must be **/opt/Bigdata/client**. + + **sh /opt/MRS\_Services_ClientConfig/install.sh /opt/Bigdata/client** + + If the following information is displayed, the client has been successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Run the following command to modify the user and user group of the **/opt/Bigdata/client** directory: + + **chown omm:wheel /opt/Bigdata/client -R** + +#. Run the following command to configure environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. .. _mrs_01_0089__li6221236418107: + + Run the client command of a component. + + For example, run the following command to query the HDFS directory: + + **hdfs dfs -ls /** + +Fully Updating the Original Client of the Standby Master Node +------------------------------------------------------------- + +#. Repeat :ref:`1 ` to :ref:`3 ` to log in to the standby Master node, and run the following command to switch to user **omm**: + + **sudo su - omm** + +#. Run the following command on the standby master node to copy the downloaded client package from the active master node: + + **scp omm@**\ *master1 nodeIP address*\ **:/tmp/MRS-client/MRS_Services_Client.tar /tmp/MRS-client/** + + .. note:: + + - In this command, **master1** node is the active master node. + - **/tmp/MRS-client/** is an example target directory of the standby master node. + +#. Repeat :ref:`4 ` to :ref:`13 ` to update the client of the standby Master node. diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/index.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/index.rst new file mode 100644 index 0000000..36be6a7 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/index.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_24183.html + +.. _mrs_01_24183: + +Using the Client of Each Component +================================== + +- :ref:`Using a ClickHouse Client ` +- :ref:`Using a Flink Client ` +- :ref:`Using a Flume Client ` +- :ref:`Using an HBase Client ` +- :ref:`Using an HDFS Client ` +- :ref:`Using a Hive Client ` +- :ref:`Using a Kafka Client ` +- :ref:`Using the Oozie Client ` +- :ref:`Using a Storm Client ` +- :ref:`Using a Yarn Client ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_a_clickhouse_client + using_a_flink_client + using_a_flume_client + using_an_hbase_client + using_an_hdfs_client + using_a_hive_client + using_a_kafka_client + using_the_oozie_client + using_a_storm_client + using_a_yarn_client diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_clickhouse_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_clickhouse_client.rst new file mode 100644 index 0000000..f28971c --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_clickhouse_client.rst @@ -0,0 +1,130 @@ +:original_name: mrs_01_24184.html + +.. _mrs_01_24184: + +Using a ClickHouse Client +========================= + +ClickHouse is a column-based database oriented to online analysis and processing. It supports SQL query and provides good query performance. The aggregation analysis and query performance based on large and wide tables is excellent, which is one order of magnitude faster than other analytical databases. + +Prerequisites +------------- + +You have installed the client, for example, in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. Before using the client, download and update the client configuration file, and ensure that the active management node of Manager is available. + +Procedure +--------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create ClickHouse tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + a. Run the following command if it is an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + b. **kinit** *Component service user* + + Example: **kinit clickhouseuser** + +#. Run the client command of the ClickHouse component. + + Run the **clickhouse -h** command to view the command help of ClickHouse. + + The command output is as follows: + + .. code-block:: + + Use one of the following commands: + clickhouse local [args] + clickhouse client [args] + clickhouse benchmark [args] + clickhouse server [args] + clickhouse performance-test [args] + clickhouse extract-from-config [args] + clickhouse compressor [args] + clickhouse format [args] + clickhouse copier [args] + clickhouse obfuscator [args] + ... + + For details about how to use the command, see https://clickhouse.tech/docs/en/operations/. + + Run the **clickhouse client** command to connect to the ClickHouse serverif MRS 3.1.0 or later. + + - Command for using SSL to log in to a ClickHouse cluster with Kerberos authentication disabled + + **clickhouse client --host** *IP address of the ClickHouse instance*\ **--user** *Username* **--password** **--port** 9440 **--secure** + + *Enter the user password.* + + - Using SSL for login when Kerberos authentication is enabled for the current cluster: + + You must create a user on Manager because there is no default user. + + After the user authentication is successful, you do not need to carry the **--user** and **--password** parameters when logging in to the client as the authenticated user. + + **clickhouse client --host** *IP address of the ClickHouse instance* **--port** 9440 **--secure** + + The following table describes the parameters of the **clickhouse client** command. + + .. table:: **Table 1** Parameters of the **clickhouse client** command + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================================================================================================================================================================================================+ + | --host | Host name of the server. The default value is **localhost**. You can use the host name or IP address of the node where the ClickHouse instance is located. | + | | | + | | .. note:: | + | | | + | | You can log in to FusionInsight Manager and choose **Cluster** > **Services** > **ClickHouse** > **Instance** to obtain the service IP address of the ClickHouseServer instance. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --port | Port for connection. | + | | | + | | - If the SSL security connection is used, the default port number is **9440**, the parameter **--secure** must be carried. For details about the port number, search for the **tcp_port_secure** parameter in the ClickHouseServer instance configuration. | + | | - If non-SSL security connection is used, the default port number is **9000**, the parameter **--secure** does not need to be carried. For details about the port number, search for the **tcp_port** parameter in the ClickHouseServer instance configuration. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --user | Username. | + | | | + | | You can create the user on Manager and bind a role to the user. | + | | | + | | - If Kerberos authentication is enabled for the current cluster and the user authentication is successful, you do not need to carry the **--user** and **--password** parameters when logging in to the client as the authenticated user. You must create a user with this name on Manager because there is no default user in the Kerberos cluster scenario. | + | | | + | | - If Kerberos authentication is not enabled for the current cluster, you can specify a user and its password created on Manager when logging in to the client. If the user and password parameters are not carried, user **default** is used for login by default. | + | | | + | | The user in normal mode (Kerberos authentication disabled) is the default user, or you can create an administrator using the open source capability provided by the ClickHouse community. You cannot use the users created on FusionInsight Manager. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --password | Password. The default password is an empty string. This parameter is used together with the **--user** parameter. You can set a password when creating a user on Manager. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --query | Query to process when using non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --database | Current default database. The default value is **default**, which is the default configuration on the server. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --multiline | If this parameter is specified, multiline queries are allowed. (**Enter** only indicates line feed and does not indicate that the query statement is complete.) | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --multiquery | If this parameter is specified, multiple queries separated with semicolons (;) can be processed. This parameter is valid only in non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --format | Specified default format used to output the result. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --vertical | If this parameter is specified, the result is output in vertical format by default. In this format, each value is printed on a separate line, which helps to display a wide table. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --time | If this parameter is specified, the query execution time is printed to **stderr** in non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --stacktrace | If this parameter is specified, stack trace information will be printed when an exception occurs. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --config-file | Name of the configuration file. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --secure | If this parameter is specified, the server will be connected in SSL mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --history_file | Path of files that record command history. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --param_ | Query with parameters. Pass values from the client to the server. For details, see https://clickhouse.tech/docs/en/interfaces/cli/#cli-queries-with-parameters. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flink_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flink_client.rst new file mode 100644 index 0000000..2ade4c5 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flink_client.rst @@ -0,0 +1,350 @@ +:original_name: mrs_01_24185.html + +.. _mrs_01_24185: + +Using a Flink Client +==================== + +This section describes how to use Flink to run wordcount jobs. + +Prerequisites +------------- + +- Flink has been installed in an MRS cluster. +- The cluster runs properly and the client has been correctly installed, for example, in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. + +Using the Flink Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to initialize environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If not, skip this whole step. + + a. Prepare a user for submitting Flink jobs.. + + b. Log in to Manager and download the authentication credential. + + Log in to Manager of the cluster. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Choose **System Settings** > **User Management**. In the **Operation** column of the row that contains the added user, choose **More** > **Download Authentication Credential**. + + c. Decompress the downloaded authentication credential package and copy the **user.keytab** file to the client node, for example, to the **/opt/hadoopclient/Flink/flink/conf** directory on the client node. If the client is installed on a node outside the cluster, copy the **krb5.conf** file to the **/etc/** directory on this node. + + d. Configure security authentication by adding the **keytab** path and username in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** configuration file. + + **security.kerberos.login.keytab:** ** + + **security.kerberos.login.principal:** ** + + Example: + + security.kerberos.login.keytab: /opt/hadoopclient/Flink/flink/conf/user.keytab + + security.kerberos.login.principal: test + + e. Generate the **generate_keystore.sh** script and place it in the **bin** directory of the Flink client. In the **bin** directory of the Flink client, run the following command to perform security hardening. For details, see `Authentication and Encryption `__. Set **password** in the following command to a password for submitting jobs: + + **sh generate_keystore.sh ** + + The script automatically replaces the SSL value in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. For an MRS 2.\ *x* or earlier security cluster, external SSL is disabled by default. To enable external SSL, configure the parameter and run the script again. For details, see `Security Hardening `__. + + .. note:: + + - You do not need to manually generate the **generate_keystore.sh** script. + - After authentication and encryption, the generated **flink.keystore**, **flink.truststore**, and **security.cookie** items are automatically filled in the corresponding configuration items in **flink-conf.yaml**. + + f. Configure paths for the client to access the **flink.keystore** and **flink.truststore** files. + + - Absolute path: After the script is executed, the file path of **flink.keystore** and **flink.truststore** is automatically set to the absolute path **/opt/hadoopclient/Flink/flink/conf/** in the **flink-conf.yaml** file. In this case, you need to move the **flink.keystore** and **flink.truststore** files from the **conf** directory to this absolute path on the Flink client and Yarn nodes. + - Relative path: Perform the following steps to set the file path of **flink.keystore** and **flink.truststore** to the relative path and ensure that the directory where the Flink client command is executed can directly access the relative paths. + + #. Create a directory, for example, **ssl**, in **/opt/hadoopclient/Flink/flink/conf/**. + + **cd /opt/hadoopclient/Flink/flink/conf/** + + **mkdir ssl** + + #. Move the **flink.keystore** and **flink.truststore** files to the **/opt/hadoopclient/Flink/flink/conf/ssl/** directory. + + **mv flink.keystore ssl/** + + **mv flink.truststore ssl/** + + #. Change the values of the following parameters to relative paths in the **flink-conf.yaml** file: + + .. code-block:: + + security.ssl.internal.keystore: ssl/flink.keystore + security.ssl.internal.truststore: ssl/flink.truststore + +#. Run a wordcount job. + + .. important:: + + To submit or run jobs on Flink, the user must have the following permissions: + + - If Ranger authentication is enabled, the current user must belong to the **hadoop** group or the user has been granted the **/flink** read and write permissions in Ranger. + - If Ranger authentication is disabled, the current user must belong to the **hadoop** group. + + - Normal cluster (Kerberos authentication disabled) + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Security cluster (Kerberos authentication enabled) + + - If the **flink.keystore** and **flink.truststore** file are stored in the absolute path: + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - If the **flink.keystore** and **flink.truststore** files are stored in the relative path: + + - In the same directory of SSL, run the following commands to start a session and submit jobs in the session. The SSL directory is a relative path. For example, if the SSL directory is **opt/hadoopclient/Flink/flink/conf/**, then run the following commands in this directory: + + **yarn-session.sh -t ssl/ -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster -yt ssl/ /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + +#. After the job has been successfully submitted, the following information is displayed on the client: + + + .. figure:: /_static/images/en-us_image_0000001349057797.png + :alt: **Figure 1** Job submitted successfully on Yarn + + **Figure 1** Job submitted successfully on Yarn + + + .. figure:: /_static/images/en-us_image_0000001296217612.png + :alt: **Figure 2** Session started successfully + + **Figure 2** Session started successfully + + + .. figure:: /_static/images/en-us_image_0000001296217608.png + :alt: **Figure 3** Job submitted successfully in the session + + **Figure 3** Job submitted successfully in the session + +#. Go to the native YARN service page, find the application of the job, and click the application name to go to the job details page. For details, see `Viewing Flink Job Information `__. + + - If the job is not completed, click **Tracking URL** to go to the native Flink page and view the job running information. + + - If the job submitted in a session has been completed, you can click **Tracking URL** to log in to the native Flink service page to view job information. + + + .. figure:: /_static/images/en-us_image_0000001349257273.png + :alt: **Figure 4** Application + + **Figure 4** Application + +Using the Flink Client (MRS 3.x or Later) +----------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to initialize environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If not, skip this whole step. + + a. Prepare a user for submitting Flink jobs. + + b. Log in to Manager and download the authentication credential. + + Log in to Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Choose **System** > **Permission** > **Manage User**. On the displayed page, locate the row that contains the added user, click **More** in the **Operation** column, and select **Download authentication credential**. + + c. Decompress the downloaded authentication credential package and copy the **user.keytab** file to the client node, for example, to the **/opt/hadoopclient/Flink/flink/conf** directory on the client node. If the client is installed on a node outside the cluster, copy the **krb5.conf** file to the **/etc/** directory on this node. + + d. Append the service IP address of the node where the client is installed, floating IP address of Manager, and IP address of the master node to the **jobmanager.web.access-control-allow-origin** and **jobmanager.web.allow-access-address** configuration item in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. Use commas (,) to separate IP addresses. + + .. code-block:: + + jobmanager.web.access-control-allow-origin: xx.xx.xxx.xxx,xx.xx.xxx.xxx,xx.xx.xxx.xxx + jobmanager.web.allow-access-address: xx.xx.xxx.xxx,xx.xx.xxx.xxx,xx.xx.xxx.xxx + + .. note:: + + - To obtain the service IP address of the node where the client is installed, perform the following operations: + + - Node inside the cluster: + + In the navigation tree of the MRS management console, choose **Clusters > Active Clusters**, select a cluster, and click its name to switch to the cluster details page. + + On the **Nodes** tab page, view the IP address of the node where the client is installed. + + - Node outside the cluster: IP address of the ECS where the client is installed. + + - To obtain the floating IP address of Manager, perform the following operations: + + - In the navigation tree of the MRS management console, choose **Clusters > Active Clusters**, select a cluster, and click its name to switch to the cluster details page. + + On the **Nodes** tab page, view the **Name**. The node that contains **master1** in its name is the Master1 node. The node that contains **master2** in its name is the Master2 node. + + - Log in to the Master2 node remotely, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the floating IP address of MRS Manager. Record the value of **inet**. If the floating IP address of MRS Manager cannot be queried on the Master2 node, switch to the Master1 node to query and record the floating IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. + + e. Configure security authentication by adding the **keytab** path and username in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** configuration file. + + **security.kerberos.login.keytab:** ** + + **security.kerberos.login.principal:** ** + + Example: + + security.kerberos.login.keytab: /opt/hadoopclient/Flink/flink/conf/user.keytab + + security.kerberos.login.principal: test + + f. Generate the **generate_keystore.sh** script and place it in the **bin** directory of the Flink client. In the **bin** directory of the Flink client, run the following command to perform security hardening. For details, see `Authentication and Encryption `__. Set **password** in the following command to a password for submitting jobs: + + **sh generate_keystore.sh ** + + The script automatically replaces the SSL value in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. + + **sh generate_keystore.sh ** + + .. note:: + + After authentication and encryption, the **flink.keystore** and **flink.truststore** files are generated in the **conf** directory on the Flink client and the following configuration items are set to the default values in the **flink-conf.yaml** file: + + - Set **security.ssl.keystore** to the absolute path of the **flink.keystore** file. + - Set **security.ssl.truststore** to the absolute path of the **flink.truststore** file. + + - Set **security.cookie** to a random password automatically generated by the **generate_keystore.sh** script. + - By default, **security.ssl.encrypt.enabled** is set to **false** in the **flink-conf.yaml** file by default. The **generate_keystore.sh** script sets **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password** to the password entered when the **generate_keystore.sh** script is called. + + - For MRS 3.\ *x* or later, if ciphertext is required and **security.ssl.encrypt.enabled** is set to **true** in the **flink-conf.yaml** file, the **generate_keystore.sh** script does not set **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password**. To obtain the values, use the Manager plaintext encryption API by running **curl -k -i -u** *Username*\ **:**\ *Password* **-X POST -HContent-type:application/json -d '{"plainText":"**\ *Password*\ **"}' 'https://**\ *x.x.x.x*\ **:28443/web/api/v2/tools/encrypt'**. + + In the preceding command, *Username*\ **:**\ *Password* indicates the user name and password for logging in to the system. The password of **"plainText"** indicates the one used to call the **generate_keystore.sh** script. *x.x.x.x* indicates the floating IP address of Manager. + + g. Configure paths for the client to access the **flink.keystore** and **flink.truststore** files. + + - Absolute path: After the script is executed, the file path of **flink.keystore** and **flink.truststore** is automatically set to the absolute path **/opt/hadoopclient/Flink/flink/conf/** in the **flink-conf.yaml** file. In this case, you need to move the **flink.keystore** and **flink.truststore** files from the **conf** directory to this absolute path on the Flink client and Yarn nodes. + - Relative path: Perform the following steps to set the file path of **flink.keystore** and **flink.truststore** to the relative path and ensure that the directory where the Flink client command is executed can directly access the relative paths. + + #. Create a directory, for example, **ssl**, in **/opt/hadoopclient/Flink/flink/conf/**. + + **cd /opt/hadoopclient/Flink/flink/conf/** + + **mkdir ssl** + + #. Move the **flink.keystore** and **flink.truststore** files to the **/opt/hadoopclient/Flink/flink/conf/ssl/** directory. + + **mv flink.keystore ssl/** + + **mv flink.truststore ssl/** + + #. Change the values of the following parameters to relative paths in the **flink-conf.yaml** file: + + .. code-block:: + + security.ssl.keystore: ssl/flink.keystore + security.ssl.truststore: ssl/flink.truststore + +#. Run a wordcount job. + + .. important:: + + To submit or run jobs on Flink, the user must have the following permissions: + + - If Ranger authentication is enabled, the current user must belong to the **hadoop** group or the user has been granted the **/flink** read and write permissions in Ranger. + - If Ranger authentication is disabled, the current user must belong to the **hadoop** group. + + - Normal cluster (Kerberos authentication disabled) + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Security cluster (Kerberos authentication enabled) + + - If the **flink.keystore** and **flink.truststore** files are stored in the absolute path: + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - If the **flink.keystore** and **flink.truststore** file are stored in the relative path: + + - In the same directory of SSL, run the following commands to start a session and submit jobs in the session. The SSL directory is a relative path. For example, if the SSL directory is **opt/hadoopclient/Flink/flink/conf/**, then run the following commands in this directory: + + **yarn-session.sh -t ssl/ -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster -yt ssl/ /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + +#. After the job has been successfully submitted, the following information is displayed on the client: + + + .. figure:: /_static/images/en-us_image_0000001349257277.png + :alt: **Figure 5** Job submitted successfully on Yarn + + **Figure 5** Job submitted successfully on Yarn + + + .. figure:: /_static/images/en-us_image_0000001296057980.png + :alt: **Figure 6** Session started successfully + + **Figure 6** Session started successfully + + + .. figure:: /_static/images/en-us_image_0000001349137689.png + :alt: **Figure 7** Job submitted successfully in the session + + **Figure 7** Job submitted successfully in the session + +#. Go to the native YARN service page, find the application of the job, and click the application name to go to the job details page. For details, see `Viewing Flink Job Information `__. + + - If the job is not completed, click **Tracking URL** to go to the native Flink page and view the job running information. + + - If the job submitted in a session has been completed, you can click **Tracking URL** to log in to the native Flink service page to view job information. + + + .. figure:: /_static/images/en-us_image_0000001296057976.png + :alt: **Figure 8** Application + + **Figure 8** Application diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flume_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flume_client.rst new file mode 100644 index 0000000..5cc686f --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_flume_client.rst @@ -0,0 +1,310 @@ +:original_name: mrs_01_24186.html + +.. _mrs_01_24186: + +Using a Flume Client +==================== + +Scenario +-------- + +You can use Flume to import collected log information to Kafka. + +Prerequisites +------------- + +- A streaming cluster that contains components such as Flume and Kafka and has Kerberos authentication enabled has been created. +- The streaming cluster can properly communicate with the node where logs are generated. + +Using the Flume Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +.. note:: + + You do not need to perform :ref:`2 ` to :ref:`6 ` for a normal cluster. + +#. Install the Flume client. + + Install the Flume client in a directory, for example, **/opt/Flumeclient**, on the node where logs are generated by referring to :ref:`Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x `. The Flume client installation directories in the following steps are only examples. Change them to the actual installation directories. + +#. .. _mrs_01_24186__en-us_topic_0264266686_l78730912572649fd8edfda3920dc20cf: + + Copy the configuration file of the authentication server from the Master1 node to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory on the node where the Flume client is installed. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_KerberosClient/kdc.conf** is used as the full file path. + + For versions earlier than MRS 3.\ *x*, **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_KerberosClient/etc/kdc.conf** is used as the full file path. + + In the preceding paths, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Check the service IP address of any node where the Flume role is deployed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager. Choose **Cluster** > **Services** > **Flume** > **Instance**. Query **Service IP Address** of any node on which the Flume role is deployed. + - For MRS 1.9.2 to versions earlier than 3.x, click the cluster name on the MRS console and choose *Name of the desired cluster* > **Components** > **Flume** > **Instances** to view **Business IP Address** of any node where the Flume role is deployed. + +#. .. _mrs_01_24186__en-us_topic_0264266686_l762ab29694a642ac8ae1a0609cb97c9b: + + Copy the user authentication file from this node to the *Flume client installation directory*\ **/fusioninsight-flume-Flume component version number/conf** directory on the Flume client node. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab** is used as the full file path. + + For versions earlier than 3.\ *x*, **${BIGDATA_HOME}/MRS\_**\ *XXX*\ **/install/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab** is used as the full file path. + + In the preceding paths, **XXX** indicates the product version number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Copy the **jaas.conf** file from this node to the **conf** directory on the Flume client node. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Flume/jaas.conf** is used as the full file path. + + For versions earlier than MRS 3.\ *x*, **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Flume/etc/jaas.conf** is used as the full file path. + + In the preceding path, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. .. _mrs_01_24186__en-us_topic_0264266686_lfde322e0f3de4ccb88b4e195e65f9993: + + Log in to the Flume client node and go to the client installation directory. Run the following command to modify the file: + + **vi conf/jaas.conf** + + Change the full path of the user authentication file defined by **keyTab** to the **Flume client installation directory/fusioninsight-flume-*Flume component version number*/conf** saved in :ref:`4 `, and save the modification and exit. + +#. Run the following command to modify the **flume-env.sh** configuration file of the Flume client: + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/flume-env.sh** + + Add the following information after **-XX:+UseCMSCompactAtFullCollection**: + + .. code-block:: + + -Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.9.0/conf/kdc.conf -Djava.security.auth.login.config=Flume client installation directory/fusioninsight-flume-1.9.0/conf/jaas.conf -Dzookeeper.request.timeout=120000 + + Example: **"-XX:+UseCMSCompactAtFullCollection -Djava.security.krb5.conf=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/kdc.conf -Djava.security.auth.login.config=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/jaas.conf -Dzookeeper.request.timeout=120000"** + + Change *Flume client installation directory* to the actual installation directory. Then save and exit. + +#. Run the following command to restart the Flume client: + + **cd** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + + Example: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + +#. Run the following command to configure and save jobs in the Flume client configuration file **properties.properties** based on service requirements. + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/properties.properties** + + The following uses SpoolDir Source+File Channel+Kafka Sink as an example: + + .. code-block:: + + ######################################################################################### + client.sources = static_log_source + client.channels = static_log_channel + client.sinks = kafka_sink + ######################################################################################### + #LOG_TO_HDFS_ONLINE_1 + + client.sources.static_log_source.type = spooldir + client.sources.static_log_source.spoolDir = Monitoring directory + client.sources.static_log_source.fileSuffix = .COMPLETED + client.sources.static_log_source.ignorePattern = ^$ + client.sources.static_log_source.trackerDir = Metadata storage path during transmission + client.sources.static_log_source.maxBlobLength = 16384 + client.sources.static_log_source.batchSize = 51200 + client.sources.static_log_source.inputCharset = UTF-8 + client.sources.static_log_source.deserializer = LINE + client.sources.static_log_source.selector.type = replicating + client.sources.static_log_source.fileHeaderKey = file + client.sources.static_log_source.fileHeader = false + client.sources.static_log_source.basenameHeader = true + client.sources.static_log_source.basenameHeaderKey = basename + client.sources.static_log_source.deletePolicy = never + + client.channels.static_log_channel.type = file + client.channels.static_log_channel.dataDirs = Data cache path. Multiple paths, separated by commas (,), can be configured to improve performance. + client.channels.static_log_channel.checkpointDir = Checkpoint storage path + client.channels.static_log_channel.maxFileSize = 2146435071 + client.channels.static_log_channel.capacity = 1000000 + client.channels.static_log_channel.transactionCapacity = 612000 + client.channels.static_log_channel.minimumRequiredSpace = 524288000 + + client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink + client.sinks.kafka_sink.kafka.topic = Topic to which data is written, for example, flume_test + client.sinks.kafka_sink.kafka.bootstrap.servers = XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number + client.sinks.kafka_sink.flumeBatchSize = 1000 + client.sinks.kafka_sink.kafka.producer.type = sync + client.sinks.kafka_sink.kafka.security.protocol = SASL_PLAINTEXT + client.sinks.kafka_sink.kafka.kerberos.domain.name = Kafka domain name. This parameter is mandatory for a security cluster, for example, hadoop.xxx.com. + client.sinks.kafka_sink.requiredAcks = 0 + + client.sources.static_log_source.channels = static_log_channel + client.sinks.kafka_sink.channel = static_log_channel + + .. note:: + + - **client.sinks.kafka_sink.kafka.topic**: Topic to which data is written. If the topic does not exist in Kafka, it is automatically created by default. + + - **client.sinks.kafka_sink.kafka.bootstrap.servers**: List of Kafka Brokers, which are separated by commas (,). By default, the port is **21007** for a security cluster and **9092** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.security.protocol**: The value is **SASL_PLAINTEXT** for a security cluster and **PLAINTEXT** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.kerberos.domain.name**: + + You do not need to set this parameter for a normal cluster. For a security cluster, the value of this parameter is the value of **kerberos.domain.name** in the Kafka cluster. + + For versions earlier than MRS 1.9.2, obtain the value by checking **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Broker/server.properties** on the node where the broker instance resides. + + Obtain the value for versions earlier than MRS 3.\ *x* by checking **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Broker/etc/server.properties** on the node where the broker instance resides. + + In the preceding paths, **X** indicates a random number. Change it based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. After the parameters are set and saved, the Flume client automatically loads the content configured in **properties.properties**. When new log files are generated by spoolDir, the files are sent to Kafka producers and can be consumed by Kafka consumers. + +Using the Flume Client (MRS 3.x or Later) +----------------------------------------- + +.. note:: + + You do not need to perform :ref:`2 ` to :ref:`6 ` for a normal cluster. + +#. Install the Flume client. + + Install the Flume client in a directory, for example, **/opt/Flumeclient**, on the node where logs are generated by referring to :ref:`Installing the Flume Client on MRS 3.x or Later Clusters `. The Flume client installation directories in the following steps are only examples. Change them to the actual installation directories. + +#. .. _mrs_01_24186__en-us_topic_0264266686_li81278495417: + + Copy the configuration file of the authentication server from the Master1 node to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory on the node where the Flume client is installed. + + The full file path is **${BIGDATA_HOME}/FusionInsight\_**\ **BASE\_**\ *XXX*\ **/1\_**\ *X*\ **\_KerberosClient/etc/kdc.conf**. In the preceding path, **XXX** indicates the product version number. **X** indicates a random number. Replace them based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Check the service IP address of any node where the Flume role is deployed. + + Log in to FusionInsight Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Choose **Cluster > Services > Flume > Instance**. Check the service IP address of any node where the Flume role is deployed. + +#. .. _mrs_01_24186__en-us_topic_0264266686_li4130849748: + + Copy the user authentication file from this node to the *Flume client installation directory*\ **/fusioninsight-flume-Flume component version number/conf** directory on the Flume client node. + + The full file path is **${BIGDATA_HOME}/FusionInsight_Porter\_**\ *XXX*\ **/install/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab**. + + In the preceding paths, **XXX** indicates the product version number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Copy the **jaas.conf** file from this node to the **conf** directory on the Flume client node. + + The full file path is **${BIGDATA_HOME}/FusionInsight_Current/1\_**\ *X*\ **\_Flume/etc/jaas.conf**. + + In the preceding path, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. .. _mrs_01_24186__en-us_topic_0264266686_li31329494415: + + Log in to the Flume client node and go to the client installation directory. Run the following command to modify the file: + + **vi conf/jaas.conf** + + Change the full path of the user authentication file defined by **keyTab** to the **Flume client installation directory/fusioninsight-flume-*Flume component version number*/conf** saved in :ref:`4 `, and save the modification and exit. + +#. Run the following command to modify the **flume-env.sh** configuration file of the Flume client: + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/flume-env.sh** + + Add the following information after **-XX:+UseCMSCompactAtFullCollection**: + + .. code-block:: + + -Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.9.0/conf/kdc.conf -Djava.security.auth.login.config=Flume client installation directory/fusioninsight-flume-1.9.0/conf/jaas.conf -Dzookeeper.request.timeout=120000 + + Example: **"-XX:+UseCMSCompactAtFullCollection -Djava.security.krb5.conf=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/kdc.conf -Djava.security.auth.login.config=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/jaas.conf -Dzookeeper.request.timeout=120000"** + + Change *Flume client installation directory* to the actual installation directory. Then save and exit. + +#. Run the following command to restart the Flume client: + + **cd** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + + Example: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + +#. Configure jobs based on actual service scenarios. + + - Some parameters, for MRS 3.\ *x* or later, can be configured on Manager. + + - Set the parameters in the **properties.properties** file. The following uses SpoolDir Source+File Channel+Kafka Sink as an example. + + Run the following command on the node where the Flume client is installed. Configure and save jobs in the Flume client configuration file **properties.properties** based on actual service requirements. + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/properties.properties** + + .. code-block:: + + ######################################################################################### + client.sources = static_log_source + client.channels = static_log_channel + client.sinks = kafka_sink + ######################################################################################### + #LOG_TO_HDFS_ONLINE_1 + + client.sources.static_log_source.type = spooldir + client.sources.static_log_source.spoolDir = Monitoring directory + client.sources.static_log_source.fileSuffix = .COMPLETED + client.sources.static_log_source.ignorePattern = ^$ + client.sources.static_log_source.trackerDir = Metadata storage path during transmission + client.sources.static_log_source.maxBlobLength = 16384 + client.sources.static_log_source.batchSize = 51200 + client.sources.static_log_source.inputCharset = UTF-8 + client.sources.static_log_source.deserializer = LINE + client.sources.static_log_source.selector.type = replicating + client.sources.static_log_source.fileHeaderKey = file + client.sources.static_log_source.fileHeader = false + client.sources.static_log_source.basenameHeader = true + client.sources.static_log_source.basenameHeaderKey = basename + client.sources.static_log_source.deletePolicy = never + + client.channels.static_log_channel.type = file + client.channels.static_log_channel.dataDirs = Data cache path. Multiple paths, separated by commas (,), can be configured to improve performance. + client.channels.static_log_channel.checkpointDir = Checkpoint storage path + client.channels.static_log_channel.maxFileSize = 2146435071 + client.channels.static_log_channel.capacity = 1000000 + client.channels.static_log_channel.transactionCapacity = 612000 + client.channels.static_log_channel.minimumRequiredSpace = 524288000 + + client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink + client.sinks.kafka_sink.kafka.topic = Topic to which data is written, for example, flume_test + client.sinks.kafka_sink.kafka.bootstrap.servers = XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number + client.sinks.kafka_sink.flumeBatchSize = 1000 + client.sinks.kafka_sink.kafka.producer.type = sync + client.sinks.kafka_sink.kafka.security.protocol = SASL_PLAINTEXT + client.sinks.kafka_sink.kafka.kerberos.domain.name = Kafka domain name. This parameter is mandatory for a security cluster, for example, hadoop.xxx.com. + client.sinks.kafka_sink.requiredAcks = 0 + + client.sources.static_log_source.channels = static_log_channel + client.sinks.kafka_sink.channel = static_log_channel + + .. note:: + + - **client.sinks.kafka_sink.kafka.topic**: Topic to which data is written. If the topic does not exist in Kafka, it is automatically created by default. + + - **client.sinks.kafka_sink.kafka.bootstrap.servers**: List of Kafka Brokers, which are separated by commas (,). By default, the port is **21007** for a security cluster and **9092** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.security.protocol**: The value is **SASL_PLAINTEXT** for a security cluster and **PLAINTEXT** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.kerberos.domain.name**: + + You do not need to set this parameter for a normal cluster. For a security cluster, the value of this parameter is the value of **kerberos.domain.name** in the Kafka cluster. + + For versions earlier than MRS 1.9.2, obtain the value by checking **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Broker/server.properties** on the node where the broker instance resides. + + Obtain the value for versions earlier than MRS 3.\ *x* by checking **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Broker/etc/server.properties** on the node where the broker instance resides. + + In the preceding paths, **X** indicates a random number. Change it based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. After the parameters are set and saved, the Flume client automatically loads the content configured in **properties.properties**. When new log files are generated by spoolDir, the files are sent to Kafka producers and can be consumed by Kafka consumers. diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_hive_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_hive_client.rst new file mode 100644 index 0000000..e2fbc83 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_hive_client.rst @@ -0,0 +1,168 @@ +:original_name: mrs_01_24189.html + +.. _mrs_01_24189: + +Using a Hive Client +=================== + +Scenario +-------- + +This section guides users to use a Hive client in an O&M or service scenario. + +Prerequisites +------------- + +- The client has been installed. For example, the client is installed in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. + +Using the Hive Client (Versions Earlier Than MRS 3.x) +----------------------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Log in to the Hive client based on the cluster authentication mode. + + - In security mode, run the following command to complete user authentication and log in to the Hive client: + + **kinit** *Component service user* + + **beeline** + + - In common mode, run the following command to log in to the Hive client. If no component service user is specified, the current OS user is used to log in to the Hive client. + + **beeline -n** *component service user* + + .. note:: + + After a beeline connection is established, you can compile and submit HQL statements to execute related tasks. To run the Catalog client command, you need to run the **!q** command first to exit the beeline environment. + +#. Run the following command to execute the HCatalog client command: + + **hcat -e** *"cmd"* + + *cmd* must be a Hive DDL statement, for example, **hcat -e "show tables"**. + + .. note:: + + - To use the HCatalog client, choose **More** > **Download Client** on the service page to download the clients of all services. This restriction does not apply to the beeline client. + + - Due to permission model incompatibility, tables created using the HCatalog client cannot be accessed on the HiveServer client. However, the tables can be accessed on the WebHCat client. + + - If you use the HCatalog client in Normal mode, the system performs DDL commands using the current user who has logged in to the operating system. + + - Exit the beeline client by running the **!q** command instead of by pressing **Ctrl + c**. Otherwise, the temporary files generated by the connection cannot be deleted and a large number of junk files will be generated as a result. + + - If multiple statements need to be entered during the use of beeline clients, separate the statements from each other using semicolons (**;**) and set the value of **entireLineAsCommand** to **false**. + + Setting method: If beeline has not been started, run the **beeline --entireLineAsCommand=false** command. If the beeline has been started, run the **!set entireLineAsCommand false** command. + + After the setting, if a statement contains semicolons (**;**) that do not indicate the end of the statement, escape characters must be added, for example, **select concat_ws('\\;', collect_set(col1)) from tbl**. + +Using the Hive Client (MRS 3.x or Later) +---------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. MRS 3.\ *X* supports multiple Hive instances. If you use the client to connect to a specific Hive instance in a scenario when multiple Hive instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this step. For example, load the environment variables of the Hive2 instance. + + **source Hive2/component_env** + +#. Log in to the Hive client based on the cluster authentication mode. + + - In security mode, run the following command to complete user authentication and log in to the Hive client: + + **kinit** *Component service user* + + **beeline** + + - In common mode, run the following command to log in to the Hive client. If no component service user is specified, the current OS user is used to log in to the Hive client. + + **beeline -n** *component service user* + +#. Run the following command to execute the HCatalog client command: + + **hcat -e** *"cmd"* + + *cmd* must be a Hive DDL statement, for example, **hcat -e "show tables"**. + + .. note:: + + - To use the HCatalog client, choose **More** > **Download Client** on the service page to download the clients of all services. This restriction does not apply to the beeline client. + + - Due to permission model incompatibility, tables created using the HCatalog client cannot be accessed on the HiveServer client. However, the tables can be accessed on the WebHCat client. + + - If you use the HCatalog client in Normal mode, the system performs DDL commands using the current user who has logged in to the operating system. + + - Exit the beeline client by running the **!q** command instead of by pressing **Ctrl + C**. Otherwise, the temporary files generated by the connection cannot be deleted and a large number of junk files will be generated as a result. + + - If multiple statements need to be entered during the use of beeline clients, separate the statements from each other using semicolons (**;**) and set the value of **entireLineAsCommand** to **false**. + + Setting method: If beeline has not been started, run the **beeline --entireLineAsCommand=false** command. If the beeline has been started, run the **!set entireLineAsCommand false** command. + + After the setting, if a statement contains semicolons (**;**) that do not indicate the end of the statement, escape characters must be added, for example, **select concat_ws('\\;', collect_set(col1)) from tbl**. + +Common Hive Client Commands +--------------------------- + +The following table lists common Hive Beeline commands. + +For more commands, see https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineCommands. + +.. table:: **Table 1** Common Hive Beeline commands + + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command | Description | + +===========================================================================================================+==============================================================================================================================================================================================================================+ + | set = | Sets the value of a specific configuration variable (key). | + | | | + | | .. note:: | + | | | + | | If the variable name is incorrectly spelled, the Beeline does not display an error. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | set | Prints the list of configuration variables overwritten by users or Hive. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | set -v | Prints all configuration variables of Hadoop and Hive. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | add FILE[S] *add JAR[S] *add ARCHIVE[S] \* | Adds one or more files, JAR files, or ARCHIVE files to the resource list of the distributed cache. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | add FILE[S] \* | Adds one or more files, JAR files, or ARCHIVE files to the resource list of the distributed cache using the lvy URL in the **ivy://goup:module:version?query_string** format. | + | | | + | add JAR[S] \* | | + | | | + | add ARCHIVE[S] \* | | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | list FILE[S]list JAR[S]list ARCHIVE[S] | Lists the resources that have been added to the distributed cache. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | list FILE[S] *list JAR[S] *list ARCHIVE[S] \* | Checks whether given resources have been added to the distributed cache. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | delete FILE[S] *delete JAR[S] *delete ARCHIVE[S] \* | Deletes resources from the distributed cache. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | delete FILE[S] \* | Delete the resource added using **** from the distributed cache. | + | | | + | delete JAR[S] \* | | + | | | + | delete ARCHIVE[S] \* | | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | reload | Enable HiveServer2 to discover the change of the JAR file **hive.reloadable.aux.jars.path** in the specified path. (You do not need to restart HiveServer2.) Change actions include adding, deleting, or updating JAR files. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | dfs | Runs the **dfs** command. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | Executes the Hive query and prints the result to the standard output. | + +-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_kafka_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_kafka_client.rst new file mode 100644 index 0000000..8f1bf5b --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_kafka_client.rst @@ -0,0 +1,115 @@ +:original_name: mrs_01_24191.html + +.. _mrs_01_24191: + +Using a Kafka Client +==================== + +Scenario +-------- + +You can create, query, and delete topics on a cluster client. + +Prerequisites +------------- + +The client has been installed. For example, the client is installed in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. + +Using the Kafka Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +#. Access the ZooKeeper instance page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services** > **ZooKeeper** > **Instance**. + - For MRS 1.9.2 or later to versions earlier than 3.x, click the cluster name on the MRS console and choose **Components** > **ZooKeeper** > **Instances**. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +#. View the IP addresses of the ZooKeeper role instance. + + Record any IP address of the ZooKeeper instance. + +#. Log in to the node where the client is installed. + +#. Run the following command to switch to the client directory, for example, **/opt/hadoopclient/Kafka/kafka/bin**. + + **cd /opt/hadoopclient/Kafka/kafka/bin** + +#. Run the following command to configure environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Kafka user* + +#. .. _mrs_01_24191__en-us_topic_0264266588_li1147589014556: + + Create a topic. + + **sh kafka-topics.sh --create --topic** *Topic name* **--partitions** *Number of partitions occupied by the topic* **--replication-factor** *Number of replicas of the topic* **--zookeeper** *IP address of the node where the ZooKeeper instance resides\ *\ **:**\ *\ clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --create --topic** **TopicTest** **--partitions** **3** **--replication-factor** **3** **--zookeeper** **10.10.10.100:2181/kafka** + +#. Run the following command to view the topic information in the cluster: + + **sh kafka-topics.sh --list --zookeeper** *IP address of the node where the ZooKeeper instance resides\ *\ **:**\ *\ clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --list --zookeeper** **10.10.10.100:2181/kafka** + +#. Delete the topic created in :ref:`7 `. + + **sh kafka-topics.sh --delete --topic** *Topic name* **--zookeeper** *IP address of the node where the ZooKeeper instance resides*:*clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --delete --topic** **TopicTest** **--zookeeper** **10.10.10.100:2181/kafka** + + Type **y** and press **Enter**. + +Using the Kafka Client (MRS 3.x or Later) +----------------------------------------- + +#. Access the ZooKeeper instance page. + + Log in to FusionInsight Manager. For details, see :ref:`Accessing MRS Manager MRS 2.1.0 or Earlier) `. Choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper** > **Instance**. + +#. View the IP addresses of the ZooKeeper role instance. + + Record any IP address of the ZooKeeper instance. + +#. Log in to the node where the client is installed. + +#. Run the following command to switch to the client directory, for example, **/opt/hadoopclient/Kafka/kafka/bin**. + + **cd /opt/hadoopclient/Kafka/kafka/bin** + +#. Run the following command to configure environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Kafka user* + +#. Log in to FusionInsight Manager, choose **Cluster** > **Name of the desired cluster** > **Services** > **ZooKeeper**, and click the **Configurations** tab and then **All Configurations**. On the displayed page, search for the **clientPort** parameter and record its value. + +#. .. _mrs_01_24191__en-us_topic_0264266588_li4808125415465: + + Create a topic. + + **sh kafka-topics.sh --create --topic** *Topic name* **--partitions** *Number of partitions occupied by the topic* **--replication-factor** *Number of replicas of the topic* **--zookeeper** *IP address of the node where the ZooKeeper instance resides\ *\ **:**\ *\ clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --create --topic** **TopicTest** **--partitions** **3** **--replication-factor** **3** **--zookeeper** **10.10.10.100:2181/kafka** + +#. Run the following command to view the topic information in the cluster: + + **sh kafka-topics.sh --list --zookeeper** *IP address of the node where the ZooKeeper instance resides\ *\ **:**\ *\ clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --list --zookeeper** **10.10.10.100:2181/kafka** + +#. Delete the topic created in :ref:`8 `. + + **sh kafka-topics.sh --delete --topic** *Topic name* **--zookeeper** *IP address of the node where the ZooKeeper instance resides*:*clientPort*\ **/kafka** + + Example: **sh kafka-topics.sh --delete --topic** **TopicTest** **--zookeeper** **10.10.10.100:2181/kafka** diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_storm_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_storm_client.rst new file mode 100644 index 0000000..0202892 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_storm_client.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_24194.html + +.. _mrs_01_24194: + +Using a Storm Client +==================== + +Scenario +-------- + +This section describes how to use the Storm client in an O&M scenario or service scenario. + +Prerequisites +------------- + +- You have installed the client. For example, the installation directory is **/opt/hadoopclient**. +- Service component users are created by the administrator as required. In security mode, machine-machine users have downloaded the keytab file. A human-machine user must change the password upon the first login. (Not involved in normal mode) + +Procedure +--------- + +#. Prepare the client based on service requirements. Log in to the node where the client is installed. + + Log in to the node where the client is installed. For details, see :ref:`Installing a Client `. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If multiple Storm instances are installed, run the following command to load the environment variables of a specific instance when running the Storm command to submit the topology. Otherwise, skip this step. The following command uses the instance Storm-2 as an example. + + **source Storm-2/component_env** + +#. Run the following command to perform user authentication (skip this step in normal mode): + + **kinit** *Component service user* + +#. Run the following command to perform operations on the client: + + For example, run the following command: + + - **cql** + - **storm** + + .. note:: + + A Storm client cannot be connected to secure and non-secure ZooKeepers at the same time. diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_yarn_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_yarn_client.rst new file mode 100644 index 0000000..a3d56c6 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_a_yarn_client.rst @@ -0,0 +1,74 @@ +:original_name: mrs_01_24196.html + +.. _mrs_01_24196: + +Using a Yarn Client +=================== + +Scenario +-------- + +This section guides users to use a Yarn client in an O&M or service scenario. + +Prerequisites +------------- + +- The client has been installed. + + For example, the installation directory is **/opt/hadoopclient**. The client directory in the following operations is only an example. Change it to the actual installation directory. + +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. In common mode, you do not need to download the keytab file or change the password. + +Using the Yarn Client +--------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, user authentication is not required. + + **kinit** *Component service user* + +#. Run the Yarn command. The following provides an example: + + **yarn application -list** + +Client-related FAQs +------------------- + +#. What Do I Do When the Yarn Client Exits Abnormally and Error Message "java.lang.OutOfMemoryError" Is Displayed After the Yarn Client Command Is Run? + + This problem occurs because the memory required for running the Yarn client exceeds the upper limit (128 MB by default) set on the Yarn client. For clusters of MRS 3.x or later: You can modify **CLIENT_GC_OPTS** in **\ **/HDFS/component_env** to change the memory upper limit of the Yarn client. For example, if you want to set the maximum memory to 1 GB, run the following command: + + .. code-block:: + + export CLIENT_GC_OPTS="-Xmx1G" + + For clusters earlier than MRS 3.x: You can modify **GC_OPTS_YARN** in *< Client installation path >*\ **/HDFS/component_env** to change the memory upper limit of the Yarn client. For example, if you want to set the maximum memory to 1 GB, run the following command: + + .. code-block:: + + export GC_OPTS_YARN="-Xmx1G" + + After the modification, run the following command to make the modification take effect: + + **source** <*Client installation path*>/**/bigdata_env** + +#. How Can I Set the Log Level When the Yarn Client Is Running? + + By default, the logs generated during the running of the Yarn client are printed to the console. The default log level is INFO. To enable the DEBUG log level for fault locating, run the following command to export an environment variable: + + **export YARN_ROOT_LOGGER=DEBUG,console** + + Then run the Yarn Shell command to print DEBUG logs. + + If you want to print INFO logs again, run the following command: + + **export YARN_ROOT_LOGGER=INFO,console** diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hbase_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hbase_client.rst new file mode 100644 index 0000000..fcdacfb --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hbase_client.rst @@ -0,0 +1,103 @@ +:original_name: mrs_01_24187.html + +.. _mrs_01_24187: + +Using an HBase Client +===================== + +Scenario +-------- + +This section describes how to use the HBase client in an O&M scenario or a service scenario. + +Prerequisites +------------- + +- The client has been installed. For example, the installation directory is **/opt/hadoopclient**. The client directory in the following operations is only an example. Change it to the actual installation directory. + +- Service component users are created by the administrator as required. + + A machine-machine user needs to download the **keytab** file and a human-machine user needs to change the password upon the first login. + +- If a non-**root** user uses the HBase client, ensure that the owner of the HBase client directory is this user. Otherwise, run the following command to change the owner. + + **chown user:group -R** *Client installation directory*\ **/HBase** + +Using the HBase Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. For details about how to configure a role with corresponding permissions, see `Creating a Role `__ To bind a role to a user, see `Creating a User `__. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + +#. Run the following HBase client command: + + **hbase shell** + +Using the HBase Client (MRS 3.x or Later) +----------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If you use the client to connect to a specific HBase instance in a scenario where multiple HBase instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this step. For example, to load the environment variables of the HBase2 instance, run the following command: + + **source HBase2/component_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. For details about how to configure a role with corresponding permissions, see `Creating a Role `__ To bind a role to a user, see `Creating a User `__. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + +#. Run the following HBase client command: + + **hbase shell** + +Common HBase client commands +---------------------------- + +The following table lists common HBase client commands. For more commands, see http://hbase.apache.org/2.2/book.html. + +.. table:: **Table 1** HBase client commands + + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command | Description | + +==========+=================================================================================================================================================================================================================================+ + | create | Used to create a table, for example, **create 'test', 'f1', 'f2', 'f3'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | disable | Used to disable a specified table, for example, **disable 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable | Used to enable a specified table, for example, **enable 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | alter | Used to alter the table structure. You can run the **alter** command to add, modify, or delete column family information and table-related parameter values, for example, **alter 'test', {NAME => 'f3', METHOD => 'delete'}**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | describe | Used to obtain the table description, for example, **describe 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | drop | Used to delete a specified table, for example, **drop 'test'**. Before deleting a table, you must stop it. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | put | Used to write the value of a specified cell, for example, **put 'test','r1','f1:c1','myvalue1'**. The cell location is unique and determined by the table, row, and column. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | get | Used to get the value of a row or the value of a specified cell in a row, for example, **get 'test','r1'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | scan | Used to query table data, for example, **scan 'test'**. The table name and scanner must be specified in the command. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hdfs_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hdfs_client.rst new file mode 100644 index 0000000..5e3e678 --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_an_hdfs_client.rst @@ -0,0 +1,101 @@ +:original_name: mrs_01_24188.html + +.. _mrs_01_24188: + +Using an HDFS Client +==================== + +Scenario +-------- + +This section describes how to use the HDFS client in an O&M scenario or service scenario. + +Prerequisites +------------- + +- The client has been installed. + + For example, the installation directory is **/opt/hadoopclient**. The client directory in the following operations is only an example. Change it to the actual installation directory. + +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user needs to change the password upon the first login. (This operation is not required in normal mode.) + +Using the HDFS Client +--------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, user authentication is not required. + + **kinit** *Component service user* + +#. Run the HDFS Shell command. Example: + + **hdfs dfs -ls /** + +Common HDFS Client Commands +--------------------------- + +The following table lists common HDFS client commands. + +For more commands, see https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CommandsManual.html#User_Commands. + +.. table:: **Table 1** Common HDFS client commands + + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | Command | Description | Example | + +================================================================================+=============================================================+=========================================================================================+ + | **hdfs dfs -mkdir** *Folder name* | Used to create a folder. | **hdfs dfs -mkdir /tmp/mydir** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -ls** *Folder name* | Used to view a folder. | **hdfs dfs -ls /tmp** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -put** *Local file on the client node* *Specified HDFS path* | Used to upload a local file to a specified HDFS path. | **hdfs dfs -put /opt/test.txt /tmp** | + | | | | + | | | Upload the **/opt/test.txt** file on the client node to the **/tmp** directory of HDFS. | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -get** *Specified file on HDFS* *Specified path on the client node* | Used to download the HDFS file to the specified local path. | **hdfs dfs -get /tmp/test.txt /opt/** | + | | | | + | | | Download the **/tmp/test.txt** file on HDFS to the **/opt** path on the client node. | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -rm -r -f** *Specified folder on HDFS* | Used to delete a folder. | **hdfs dfs -rm -r -f /tmp/mydir** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -chmod** *Permission parameter File directory* | Used to configure the HDFS directory permission for a user. | **hdfs dfs -chmod 700 /tmp/test** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + +Client-related FAQs +------------------- + +#. What do I do when the HDFS client exits abnormally and error message "java.lang.OutOfMemoryError" is displayed after the HDFS client command is running? + + This problem occurs because the memory required for running the HDFS client exceeds the preset upper limit (128 MB by default). You can change the memory upper limit of the client by modifying **CLIENT_GC_OPTS** in **\ **/HDFS/component_env**. For example, if you want to set the upper limit to 1 GB, run the following command: + + .. code-block:: + + CLIENT_GC_OPTS="-Xmx1G" + + After the modification, run the following command to make the modification take effect: + + **source** <*Client installation path*>/**/bigdata_env** + +#. How do I set the log level when the HDFS client is running? + + By default, the logs generated during the running of the HDFS client are printed to the console. The default log level is INFO. To enable the DEBUG log level for fault locating, run the following command to export an environment variable: + + **export HADOOP_ROOT_LOGGER=DEBUG,console** + + Then run the HDFS Shell command to generate the DEBUG logs. + + If you want to print INFO logs again, run the following command: + + **export HADOOP_ROOT_LOGGER=INFO,console** + +#. How do I delete HDFS files permanently? + + HDFS provides a recycle bin mechanism. Typically, after an HDFS file is deleted, the file is moved to the recycle bin of HDFS. If the file is no longer needed and the storage space needs to be released, clear the corresponding recycle bin directory, for example, **hdfs://hacluster/user/xxx/.Trash/Current/**\ *xxx*. diff --git a/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_the_oozie_client.rst b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_the_oozie_client.rst new file mode 100644 index 0000000..412612a --- /dev/null +++ b/umn/source/using_an_mrs_client/using_the_client_of_each_component/using_the_oozie_client.rst @@ -0,0 +1,76 @@ +:original_name: mrs_01_24193.html + +.. _mrs_01_24193: + +Using the Oozie Client +====================== + +Scenario +-------- + +This section describes how to use the Oozie client in an O&M scenario or service scenario. + +Prerequisites +------------- + +- The client has been installed. For example, the installation directory is **/opt/client**. The client directory in the following operations is only an example. + +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. + + +Using the Oozie Client +---------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to switch to the client installation directory (change it to the actual installation directory): + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Check the cluster authentication mode. + + - If the cluster is in security mode, run the following command to authenticate the user: *exampleUser* indicates the name of the user who submits tasks. + + **kinit** *exampleUser* + + - If the cluster is in normal mode, go to :ref:`5 `. + +#. .. _mrs_01_24193__en-us_topic_0264266061_li1585491617519: + + Perform the following operations to configure Hue: + + a. Configure the Spark2x environment (skip this step if the Spark2x task is not involved): + + **hdfs dfs -put /opt/client/Spark2x/spark/jars/*.jar /user/oozie/share/lib/spark2x/** + + When the JAR package in the HDFS directory **/user/oozie/share** changes, you need to restart the Oozie service. + + b. Upload the Oozie configuration file and JAR package to HDFS. + + **hdfs dfs -mkdir /user/**\ *exampleUser* + + **hdfs dfs -put -f /opt/client/Oozie/oozie-client-*/examples /user/**\ *exampleUser*/ + + .. note:: + + - *exampleUser* indicates the name of the user who submits tasks. + + - If the user who submits the task and other files except **job.properties** are not changed, client installation directory **Oozie/oozie-client-*/examples** can be repeatedly used after being uploaded to HDFS. + + - Resolve the JAR file conflict between Spark and Yarn about Jetty. + + **hdfs dfs -rm -f /user/oozie/share/lib/spark/jetty-all-9.2.22.v20170606.jar** + + - In normal mode, if **Permission denied** is displayed during the upload, run the following commands: + + **su - omm** + + **source /opt/client/bigdata_env** + + **hdfs dfs -chmod -R 777 /user/oozie** + + **exit**