WebPalm is a command-line tool that enables users to traverse a website and generate a tree of all its webpages and their links. It uses a recursive approach to enter each link found on a webpage and continues to do so until all levels have been explored. In addition to generating a site map, WebPalm can extract data from the body of each page using regular expressions and save the results in a file. This feature can be useful for web scraping or extracting specific information.
this tool is intended to be used for legal purposes only, and you are responsible for your actions.
- Generate a palm tree struct of web urls
- Dump data from body pages using regular expressions
- Multi-threading and parallelism
- Export the web-tree to json, xml, txt
- Fast and easy to use
- Colorized output and error handling
git clone https://github.com/Malwarize/webpalm.git
cd webpalm
go build -o webpalm && ./webpalm
you can download the binary from Releases
wget https://github.com/Malwarize/webpalm/releases/download/v0.0.1/webpalm_x.x.x_os_arch.tar.gz
tar -xvf webpalm_x.x.x_os_arch.tar.gz
cd webpalm
./webpalm
go install github.com/Malwarize/webpalm/v2@latest
webpalm -h
Flags:
-d, --delay int delay (ms) between each request / ex: -d 200
-x, --exclude-code ints status codes to exclude / ex : -x 404,500
-h, --help help for webpalm
-i, --include strings include only domains / ex : -i google.com,facebook.com
-l, --level int level of palming / ex: -l2
-o, --output string file to export the result (f.json, f.xml, f.txt) / ex: -o result.json
-p, --proxy string proxy to use / ex: -p http://proxy.com:8080
--regexes stringToString regexes to match in each page / ex: --regexes comments="\<\!--.*?-->" (default [])
-t, --timeout int timeout in seconds / ex: -t 10 (default 10)
-u, --url string target url / ex: -u https://google.com
-a, --user-agent string user agent to use / ex: -a chrome, firefox, safari, ie, edge, opera, android, ios, custom
-v, --version version for webpalm
-w, --worker int number of workers for multi-threading / ex: -w 10
webpalm -u https://google.com -l1
# or
webpalm -u https://google.com -l1 -w 3 # 3 workers (multi-threading)
webpalm -u https://google.com -l1 -x 404,500
webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->" -o result.json
this will dump the comments of each page in the body of the page
webpalm -u https://google.com -l1 --regexes comments="\<\!--.*?-->",emails="([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+)"
this will dump the comments and emails of each page in the body of the page
webpalm -u https://google.com -l3 -o result.xml
webpalm -u https://google.com -l2 -o result.txt
webpalm -u https://google.com -l2 -i google.com,facebook.com
this will crawl only the urls that contains google.com or facebook.com
webpalm -u https://google.com -l2 -w 100
Regex | Pattern |
---|---|
emails | ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+) |
comments | \<\!--.*?--> |
tokens | [a-zA-Z0-9]{32} |
password | \bpassword\b.{0,10} |
Don't forget escaping the regexes if needed
You can run unit tests to gain more confidence in the enhancements or changes to the code by running go test -v ./...
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
you can also contact me on discord:xorbit.