Arachnid is a fast and powerful web scraping framework for Crystal. It provides an easy to use DSL for scraping webpages and processing all of the things you might come across.
Arachnid has a ton of configration options which can be passed to the mehthods listed below in [Crawling](#crawling) and to the constructor for `Arachnid::Agent`. They are as follows:
- **read_timeout** - Read timeout
- **connect_timeout** - Connect timeout
- **max_redirects** - Maximum amount of redirects to follow
- **do_not_track** - Sets the DNT header
- **default_headers** - Default HTTP headers to use for all hosts
- **host_header** - HTTP host header to use
- **host_headers** - HTTP headers to use for specific hosts
- **user_agent** - sets the user agent for the crawler
- **referer** - Referer to use
- **fetch_delay** - Delay in between fetching resources
- **queue** - Preload the queue with urls
- **history** - Links that should not be visited
- **limit** - Maximum number of resources to visit
- **max_depth** - Maximum crawl depth
- **filter_options** - Passed to [`initialize_filters`]()
There are also a few class properties on `Arachnid` itself which are used as the defaults, unless overrided.
- **do_not_track**
- **max_redirects**
- **connect_timeout**
- **read_timeout**
- **user_agent**
### Crawling
Arachnid provides 3 interfaces to use for crawling:
`start_at` is what you want to use if you're going to be doing a full crawl of multiple sites. It doesn't filter any urls by default and will scan every link it encounters.
All of these methods have the ability to also take a block instead of a pattern, where the block returns true or false. The only difference between `links` and `urls` in this case is with the block argument. `links` receives a `String` and `urls` a `URI`. Honestly I'll probably get rid of `links` soon and just make it `urls`.
`exts` looks at the extension, if it exists, and fiters base on that.
### Events
Every crawled "page" is referred to as a resource, since sometimes they will be html/xml, sometimes javascript or css, and sometimes images, videos, zip files, etc. Every time a resource is scanned one of several events is called. They are:
#### `every_url(&block : URI ->)`
Pass each URL from each resource visited to the given block.
#### `every_failed_url(&block : URI ->)`
Pass each URL that could not be requested to the given block.
#### `every_url_like(pattern, &block : URI ->)`
Pass every URL that the agent visits, and matches a given pattern, to a given block.
#### `urls_like(pattern, &block : URI ->)`
Same as `every_url_like`
#### `all_headers(&block : HTTP::Headers)`
Pass the headers from every response the agent receives to a given block.
#### `every_resource(&block : Resource ->)`
Pass every resource that the agent visits to a given block.
#### `every_ok_page(&block : Resource ->)`
Pass every OK resource that the agent visits to a given block.
#### `every_redirect_page(&block : Resource ->)`
Pass every Redirect resource that the agent visits to a given block.
#### `every_timedout_page(&block : Resource ->)`
Pass every Timeout resource that the agent visits to a given block.
Passes every resource with a matching content type to the given block.
#### `every_link(&block : URI, URI ->)`
Passes every origin and destination URI of each link to a given block.
### Content Types
Every resource has an associated content type and the `Resource` class itself provides several easy methods to check it. You can find all of them [here]().
### Parsing HTML
Every HTML/XML resource has full access to the suite of methods provided by [Crystagiri]() allowing you to more easily search by css selector.