Help Center Spider
About
This is a spider tool with which you can visit all links on https://docs.otc.t-systems.com to find urls that are not correct.
Requirements
After you cloned the repository you need to prepare an environment to run the tool. You can easily do this with python virtual environment:
$ cd <local_folder>/
$ python -m venv venv/
$ python -m pip install -r requirements.txt
Configuration
In config.json you can define a couple items:
- watchdog_file: if you run the tool in the background and want to stop it properly (not using
kill
), just send an exit message into the watchdog file:echo exit > watchdog.fifo
- timer_runtime: maximum runtime limit in seconds
- log_dir: logging folder
- logging_interval: frequency of dumping log files
- workers: number of workers (background processes) you want to run. If you set to 0 it will count from the number of cores (number_of_cores - 1)
- starting_point: base url where to start
How to run
There are two ways to do it
In foreground
$ source venv/bin/activate
$ python main.py
In background
$ source venv/bin/activate
$ nohup python main.py > log/hc_spider.log 2> log/hc_spider.err <&- &
In case you running the tool in background you can stop the execution with $ echo exit > <watchdog_file>
Description
Languages
Python
100%