Scanning the web efficiently
Introduction#
In the recent years, I've been taking part in more and more Capture The Flag (CTF) competitions. One of the most common tasks in these competitions is to find hidden flags on a web server. This usually involves scanning the web server for hidden files and directories. This is a very long and boring process, and I felt like the existing tools were not efficient enough.
I decided to write my own tool to scan the web efficiently, and I called it rwalk
.
I find it very important to have a good understanding of the inner workings of the tools we use, so this article will explain both the process of writing the tool and the features it offers.
Existing tools#
Before starting to write my own tool, let's take a look at the existing ones and see what they offer.
ffuf#
ffuf
is a fast web fuzzer written in Go.
It is based on a FUZZ
keyword that is replaced by the words in a wordlist. It can thus be used to scan for parameters, headers, and directories.
Dirsearch#
dirsearch
is a simple command line tool designed to brute force directories and files in web servers.
Writing the tool#
I chose Rust as the programming language for this tool because it's my favorite at the moment and has a great ecosystem for writing command line tools,
with libraries like clap
and reqwest
.
The first functionality I needed was to scan a target recursively. This was achieved by using a simple tree structure.
Pseudocode for the search algorithmThe algorithm is quite simple:
- We retrieve the nodes at the current depth. (The first time, it's the root node
/
) - We iterate over the wordlist and make a request to the server for each word.
- If the request is successful, we add the new node to the tree, with the previous node as its parent.
- We go to the next depth.
This single-threaded approach was already quite efficient, but I wanted to make it even faster with multi-threading.
Such an approach requires a bit of refactoring, but Rust makes it easy to do so with the tokio
library.
Chunks of the wordlist are distributed to different threads, and the results are collected at the end.
Pseudocode for the multi-threaded version of the previous algorithmNotice that the tx
+ rx
pattern is used to signal the main thread when a thread has finished its work.
We do not need to send any result back as the tree is already wrapped in an Arc<Mutex<T>>
and can be accessed from any thread.
Adding more features#
Filtering#
One of the most important features of a web scanner is to be able to filter the responses.
Basic filters include:
- Status code
- Response body
- Response size
- Response body hash
- Response headers
- Response time
I added those to rwalk
and made them configurable via the command line.
The check_range()
and parse_range()
functions are used to parse and check a specific type of filter, which is a range.
Format | Description |
---|---|
5 | Exactly 5 |
5-10 | Between 5 and 10 (inclusive) |
5,10 | Exactly 5 or 10 |
>5 | Greater than 5 |
<5 | Less than 5 |
5,10,15 | Exactly 5 , 10 , or 15 |
>5,10,15 | Greater than 5 , or exactly 10 or 15 |
5-10,15-20 | Between 5 and 10 or between 15 and 20 (inclusive) |
This allows for a lot of flexibility when filtering the responses, as most of them require a numeric comparison (more, less, equal, etc.).
For example, let's say we want to filter the responses that only took less than 100
milliseconds:
We also needed a way to negate the filters, which is why the !
character is used to do so with a XOR operation on the base result:
Filter Output | Negated | Result |
---|---|---|
true | false | true |
false | false | false |
true | true | false |
false | true | true |
The default behavior is to and
all the filters, but the or
behavior can be enabled with the --or
flag.
Most of the modern API responses are JSON, and we might want to filter them based on their content.
We need to be able to easily access the JSON fields and compare them to a value.
The following format is used to filter a JSON response:
To filter only the responses that contain deadbeef
as the first item of the data
array:
Example#
Let's test out these filters with ffuf.me/cd/no404 !
{" "} **Note:** See the [install section](#test-it-out-yourself) to install the tool and try it out yourself.{" "}We start out with a basic scan as follows:
And we get a bunch of 200s:
Let's make use of the --show
flag, that allows us to display additional information on each request.
This yields the body hash of each response:
We can now try our luck and search for a response that does not match 1c3af61e88de6618c0fabbe48edf5ed9
:
And voilà !
If you are curious, you can also see what was the content of the response with the --show body
option.
Output#
You might want to save the results of your scan to a file, to be able to analyze them with another tool or to share them with someone else.
The available formats are:
- JSON
- CSV
- Markdown
- Plain text
But as I was using a tree structure to search for the paths, I thought it would be a good idea to print a nice tree a the end of each scan.
The library ptree
was perfect for this.
Permutations#
In its current state, rwalk
is only able to scan a target at the end of a path.
But what if we needed to fuzz a path like /admin/.../login/.../
?
This is why I added a new scanning mode called classic
, which allowed for both regular and permutation scanning.
The separation between these modes is due to the fact that they rely on different algorithms.
The permutation scanning mode is based on the itertools
crate's .permutations()
method.
We first compute the number of tokens in the URL, then generate all the n
-permutations of the wordlist and replace the tokens with the permutations.
Wordlists#
When you need to quickly scan a target, you might not want to use a huge wordlist.
rwalk
comes with a bunch of filters and transformations to help you generate a wordlist on the fly.
Filters#
Let's say you need to scan a target for PHP files, you can use the --filter
flag to only keep the words that end with .php
.
or you might want to keep only 4-letter words:
Transformations#
You might want to transform the words in the wordlist to lowercase, uppercase, or capitalize them.
or perhaps you want to replace each instance of a
with @
:
These two features combined allow for a lot of flexibility when generating a granular wordlist.
You can try this feature with ffuf.me/cd/ext/logs and the following command:
This will generate a wordlist with the .log
suffix and scan the target for files with this extension.
Throttling#
When scanning a target, you sometimes need to throttle the requests to avoid being blocked by the server.
Let's take a concrete example with ffuf.me/cd/rate.
We start with a basic scan:
This yields us an empty scan result:
This means that nothing has been matched, let's take a look at the response codes by overriding the default filter:
We can see that we are being rate-limited by the server, so we need to throttle the requests. The number of requests per second is equal to the throttle times the number of threads.
For example, if we have 110 threads and a throttle of 10, we will have 1100 requests per second. Let's go with 10 threads and a 5 second throttle (50 requests per second).
After a few minutes, we get the following:
Final result#
We now have a tool that is able to scan a target efficiently, approaching 2000 requests per second on a regular WiFi connection.
Test it out yourself#
You can find the source code on GitHub.
The tool can be installed with brew
or cargo
:
or
I like to use ffuf.me to test out the tool, as it's a great target for web scanning.
A good wordlist to start with is common.txt.
Recursive scan of ffuf.me Classic scan of ffuf.meI hope you find this tool as useful as I do
Feel free to open an issue or join the Discord if you have any questions or suggestions !