arachnid/docs/index.html

739 lines
42 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="generator" content="Crystal Docs 0.29.0">
<link href="css/style.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="js/doc.js"></script>
<script type="text/javascript">
CrystalDoc.base_path = "";
</script>
<meta id="repository-name" content="github.com/watzon/arachnid">
<title>README - github.com/watzon/arachnid</title>
</head>
<body>
<div class="sidebar">
<div class="sidebar-header">
<div class="search-box">
<input type="search" class="search-input" placeholder="Search..." spellcheck="false" aria-label="Search">
</div>
<div class="repository-links">
<a href="index.html">README</a>
</div>
</div>
<div class="search-results" class="hidden">
<ul class="search-list"></ul>
</div>
<div class="types-list">
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid" data-name="arachnid">
<a href="Arachnid.html">Arachnid</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Agent" data-name="arachnid::agent">
<a href="Arachnid/Agent.html">Agent</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions" data-name="arachnid::agent::actions">
<a href="Arachnid/Agent/Actions.html">Actions</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions/Action" data-name="arachnid::agent::actions::action">
<a href="Arachnid/Agent/Actions/Action.html">Action</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions/Paused" data-name="arachnid::agent::actions::paused">
<a href="Arachnid/Agent/Actions/Paused.html">Paused</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions/RuntimeError" data-name="arachnid::agent::actions::runtimeerror">
<a href="Arachnid/Agent/Actions/RuntimeError.html">RuntimeError</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions/SkipLink" data-name="arachnid::agent::actions::skiplink">
<a href="Arachnid/Agent/Actions/SkipLink.html">SkipLink</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Actions/SkipResource" data-name="arachnid::agent::actions::skipresource">
<a href="Arachnid/Agent/Actions/SkipResource.html">SkipResource</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Agent/Queue" data-name="arachnid::agent::queue">
<a href="Arachnid/Agent/Queue.html">Queue</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/AuthCredential" data-name="arachnid::authcredential">
<a href="Arachnid/AuthCredential.html">AuthCredential</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/AuthStore" data-name="arachnid::authstore">
<a href="Arachnid/AuthStore.html">AuthStore</a>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli" data-name="arachnid::cli">
<a href="Arachnid/Cli.html">Cli</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Action" data-name="arachnid::cli::action">
<a href="Arachnid/Cli/Action.html">Action</a>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library" data-name="arachnid::cli::command_main_command_of_clim_library">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library.html">Command_Main_command_of_clim_library</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap.html">Command_Sitemap</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap.html">Options_Sitemap</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_json">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json.html">Option_json</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_output">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output.html">Option_output</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_xml">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml.html">Option_xml</a>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap.html">Options_Sitemap</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_json">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json.html">Option_json</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_output">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output.html">Option_output</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_xml">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml.html">Option_xml</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/RunProc" data-name="arachnid::cli::command_main_command_of_clim_library::command_sitemap::runproc">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/RunProc.html">RunProc</a>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize.html">Command_Summarize</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize.html">Options_Summarize</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_codes">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes.html">Option_codes</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_elinks">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks.html">Option_elinks</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_ilinks">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks.html">Option_ilinks</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_limit">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit.html">Option_limit</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_output">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output.html">Option_output</a>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize.html">Options_Summarize</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_codes">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes.html">Option_codes</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_elinks">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks.html">Option_elinks</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_ilinks">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks.html">Option_ilinks</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_limit">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit.html">Option_limit</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_output">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output.html">Option_output</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/RunProc" data-name="arachnid::cli::command_main_command_of_clim_library::command_summarize::runproc">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/RunProc.html">RunProc</a>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library.html">Options_Main_command_of_clim_library</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_version">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version.html">Option_version</a>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library.html">Options_Main_command_of_clim_library</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_help">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help.html">Option_help</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version" data-name="arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_version">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version.html">Option_version</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/RunProc" data-name="arachnid::cli::command_main_command_of_clim_library::runproc">
<a href="Arachnid/Cli/Command_Main_command_of_clim_library/RunProc.html">RunProc</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Count" data-name="arachnid::cli::count">
<a href="Arachnid/Cli/Count.html">Count</a>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Cli/Sitemap" data-name="arachnid::cli::sitemap">
<a href="Arachnid/Cli/Sitemap.html">Sitemap</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Sitemap/LastMod" data-name="arachnid::cli::sitemap::lastmod">
<a href="Arachnid/Cli/Sitemap/LastMod.html">LastMod</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Cli/Sitemap/PageMap" data-name="arachnid::cli::sitemap::pagemap">
<a href="Arachnid/Cli/Sitemap/PageMap.html">PageMap</a>
</li>
</ul>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/CookieJar" data-name="arachnid::cookiejar">
<a href="Arachnid/CookieJar.html">CookieJar</a>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Document" data-name="arachnid::document">
<a href="Arachnid/Document.html">Document</a>
<ul>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Document/HTML" data-name="arachnid::document::html">
<a href="Arachnid/Document/HTML.html">HTML</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Document/HTML/Tag" data-name="arachnid::document::html::tag">
<a href="Arachnid/Document/HTML/Tag.html">Tag</a>
</li>
</ul>
</li>
</ul>
</li>
<li class="parent " data-id="github.com/watzon/arachnid/Arachnid/Resource" data-name="arachnid::resource">
<a href="Arachnid/Resource.html">Resource</a>
<ul>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Resource/ContentTypes" data-name="arachnid::resource::contenttypes">
<a href="Arachnid/Resource/ContentTypes.html">ContentTypes</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Resource/Cookies" data-name="arachnid::resource::cookies">
<a href="Arachnid/Resource/Cookies.html">Cookies</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Resource/HTML" data-name="arachnid::resource::html">
<a href="Arachnid/Resource/HTML.html">HTML</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Resource/StatusCodes" data-name="arachnid::resource::statuscodes">
<a href="Arachnid/Resource/StatusCodes.html">StatusCodes</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/Rules" data-name="arachnid::rules(t)">
<a href="Arachnid/Rules.html">Rules</a>
</li>
<li class=" " data-id="github.com/watzon/arachnid/Arachnid/SessionCache" data-name="arachnid::sessioncache">
<a href="Arachnid/SessionCache.html">SessionCache</a>
</li>
</ul>
</li>
<li class=" " data-id="github.com/watzon/arachnid/URI" data-name="uri">
<a href="URI.html">URI</a>
</li>
</ul>
</div>
</div>
<div class="main-content">
<h1>Arachnid</h1>
<p>Arachnid is a fast and powerful web scraping framework for Crystal. It provides an easy to use DSL for scraping webpages and processing all of the things you might come across.</p>
<ul><li><a href="#Arachnid" target="_blank">Arachnid</a></li><ul><li><a href="#Installation" target="_blank">Installation</a></li><li><a href="#Examples" target="_blank">Examples</a></li><li><a href="#Usage" target="_blank">Usage</a></li><li><a href="#Configuration" target="_blank">Configuration</a></li><li><a href="#Crawling" target="_blank">Crawling</a></li><li><a href="#Arachnidstartaturl-options-block--Agent" target="_blank"><code>Arachnid#start_at(url, **options, &block : Agent ->)</code></a></li><li><a href="#Arachnidsiteurl-options-block--Agent" target="_blank"><code>Arachnid#site(url, **options, &block : Agent ->)</code></a></li><li><a href="#Arachnidhostname-options-block--Agent" target="_blank"><code>Arachnid#host(name, **options, &block : Agent ->)</code></a></li><li><a href="#Crawling-Rules" target="_blank">Crawling Rules</a></li><li><a href="#Events" target="_blank">Events</a></li><li><a href="#everyurlblock--URI" target="_blank"><code>every_url(&block : URI ->)</code></a></li><li><a href="#everyfailedurlblock--URI" target="_blank"><code>every_failed_url(&block : URI ->)</code></a></li><li><a href="#everyurllikepattern-block--URI" target="_blank"><code>every_url_like(pattern, &block : URI ->)</code></a></li><li><a href="#urlslikepattern-block--URI" target="_blank"><code>urls_like(pattern, &block : URI ->)</code></a></li><li><a href="#allheadersblock--HTTPHeaders" target="_blank"><code>all_headers(&block : HTTP::Headers)</code></a></li><li><a href="#everyresourceblock--Resource" target="_blank"><code>every_resource(&block : Resource ->)</code></a></li><li><a href="#everyokpageblock--Resource" target="_blank"><code>every_ok_page(&block : Resource ->)</code></a></li><li><a href="#everyredirectpageblock--Resource" target="_blank"><code>every_redirect_page(&block : Resource ->)</code></a></li><li><a href="#everytimedoutpageblock--Resource" target="_blank"><code>every_timedout_page(&block : Resource ->)</code></a></li><li><a href="#everybadrequestpageblock--Resource" target="_blank"><code>every_bad_request_page(&block : Resource ->)</code></a></li><li><a href="#def-everyunauthorizedpageblock--Resource" target="_blank"><code>def every_unauthorized_page(&block : Resource ->)</code></a></li><li><a href="#everyforbiddenpageblock--Resource" target="_blank"><code>every_forbidden_page(&block : Resource ->)</code></a></li><li><a href="#everymissingpageblock--Resource" target="_blank"><code>every_missing_page(&block : Resource ->)</code></a></li><li><a href="#everyinternalservererrorpageblock--Resource" target="_blank"><code>every_internal_server_error_page(&block : Resource ->)</code></a></li><li><a href="#everytxtpageblock--Resource" target="_blank"><code>every_txt_page(&block : Resource ->)</code></a></li><li><a href="#everyhtmlpageblock--Resource" target="_blank"><code>every_html_page(&block : Resource ->)</code></a></li><li><a href="#everyxmlpageblock--Resource" target="_blank"><code>every_xml_page(&block : Resource ->)</code></a></li><li><a href="#everyxslpageblock--Resource" target="_blank"><code>every_xsl_page(&block : Resource ->)</code></a></li><li><a href="#everydocblock--DocumentHTML--XMLNode" target="_blank"><code>every_doc(&block : Document::HTML | XML::Node ->)</code></a></li><li><a href="#everyhtmldocblock--DocumentHTML--XMLNode" target="_blank"><code>every_html_doc(&block : Document::HTML | XML::Node ->)</code></a></li><li><a href="#everyxmldocblock--XMLNode" target="_blank"><code>every_xml_doc(&block : XML::Node ->)</code></a></li><li><a href="#everyxsldocblock--XMLNode" target="_blank"><code>every_xsl_doc(&block : XML::Node ->)</code></a></li><li><a href="#everyrssdocblock--XMLNode" target="_blank"><code>every_rss_doc(&block : XML::Node ->)</code></a></li><li><a href="#everyatomdocblock--XMLNode" target="_blank"><code>every_atom_doc(&block : XML::Node ->)</code></a></li><li><a href="#everyjavascriptblock--Resource" target="_blank"><code>every_javascript(&block : Resource ->)</code></a></li><li><a href="#everycssblock--Resource" target="_blank"><code>every_css(&block : Resource ->)</code></a></li><li><a href="#everyrssblock--Resource" target="_blank"><code>every_rss(&block : Resource ->)</code></a></li><li><a href="#everyatomblock--Resource" target="_blank"><code>every_atom(&block : Resource ->)</code></a></li><li><a href="#everymswordblock--Resource" target="_blank"><code>every_ms_word(&block : Resource ->)</code></a></li><li><a href="#everypdfblock--Resource" target="_blank"><code>every_pdf(&block : Resource ->)</code></a></li><li><a href="#everyzipblock--Resource" target="_blank"><code>every_zip(&block : Resource ->)</code></a></li><li><a href="#everyimageblock--Resource" target="_blank"><code>every_image(&block : Resource ->)</code></a></li><li><a href="#everycontenttypecontenttype--String--Regex-block--Resource" target="_blank"><code>every_content_type(content_type : String | Regex, &block : Resource ->)</code></a></li><li><a href="#everylinkblock--URI-URI" target="_blank"><code>every_link(&block : URI, URI ->)</code></a></li><li><a href="#Content-Types" target="_blank">Content Types</a></li><li><a href="#Parsing-HTML" target="_blank">Parsing HTML</a></li><li><a href="#Contributing" target="_blank">Contributing</a></li><li><a href="#Contributors" target="_blank">Contributors</a></li></ul></ul>
<h2>Installation</h2>
<ol><li>Add the dependency to your <code>shard.yml</code>:</li></ol>
<p><code></code>`yaml
dependencies:</p>
<pre><code> arachnid:
github: watzon<span class="s">/arachnid
version: ~&gt; 0.1.0</code></pre>
<p><code></code>`</p>
<ol><li>Run <code>shards install</code></li></ol>
<h2>Examples</h2>
<p>Arachnid provides an easy to use, powerful DSL for scraping websites.</p>
<pre><code class="language-crystal"><span class="k">require</span> <span class="s">&quot;arachnid&quot;</span>
<span class="k">require</span> <span class="s">&quot;json&quot;</span>
<span class="c"># Let&#39;s build a sitemap of crystal-lang.org</span>
<span class="c"># Links will be a hash of url to resource title</span>
links <span class="o">=</span> {} <span class="k">of</span> <span class="t">String</span> => <span class="t">String</span>
<span class="c"># Visit a particular host, in this case `crystal-lang.org`. This will</span>
<span class="c"># not match on subdomains.</span>
<span class="t">Arachnid</span>.host(<span class="s">&quot;https://crystal-lang.org&quot;</span>) <span class="k">do</span> <span class="o">|</span>spider<span class="o">|</span>
<span class="c"># Ignore the API secion. It&#39;s a little big.</span>
spider.ignore_urls_like(<span class="s">/\/(api)\//</span>)
spider.every_html_page <span class="k">do</span> <span class="o">|</span>page<span class="o">|</span>
puts <span class="s">&quot;Visiting </span><span class="i">#{</span>page.url.to_s<span class="i">}</span><span class="s">&quot;</span>
<span class="c"># Ignore redirects for our sitemap</span>
<span class="k">unless</span> page.redirect?
<span class="c"># Add the url of every visited page to our sitemap</span>
links[page.url.to_s] <span class="o">=</span> page.title.to_s.strip
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="t">File</span>.write(<span class="s">&quot;crystal-lang.org-sitemap.json&quot;</span>, links.to_pretty_json)</code></pre>
<p>Want to scan external links as well?</p>
<pre><code class="language-crystal"><span class="c"># To make things interesting, this time let&#39;s download</span>
<span class="c"># every image we find.</span>
<span class="t">Arachnid</span>.start_at(<span class="s">&quot;https://crystal-lang.org&quot;</span>) <span class="k">do</span> <span class="o">|</span>spider<span class="o">|</span>
<span class="c"># Set a base path to store all the images at</span>
base_image_dir <span class="o">=</span> <span class="t">File</span>.expand_path(<span class="s">&quot;~/Pictures/arachnid&quot;</span>)
<span class="t">Dir</span>.mkdir_p(base_image_dir)
<span class="c"># You could also use `every_image`. This allows us to</span>
<span class="c"># track the crawler though.</span>
spider.every_resource <span class="k">do</span> <span class="o">|</span>resource<span class="o">|</span>
puts <span class="s">&quot;Scanning </span><span class="i">#{</span>resource.url.to_s<span class="i">}</span><span class="s">&quot;</span>
<span class="k">if</span> resource.image?
<span class="c"># Since we&#39;re going to be saving a lot of images</span>
<span class="c"># let&#39;s spawn a new fiber for each one. This</span>
<span class="c"># makes things so much faster.</span>
spawn <span class="k">do</span>
<span class="c"># Output directory for images for this host</span>
directory <span class="o">=</span> <span class="t">File</span>.join(base_image_dir, resource.url.host.to_s)
<span class="t">Dir</span>.mkdir_p(directory)
<span class="c"># The name of the image</span>
filename <span class="o">=</span> <span class="t">File</span>.basename(resource.url.path)
<span class="c"># Save the image using the body of the resource</span>
puts <span class="s">&quot;Saving </span><span class="i">#{</span>filename<span class="i">}</span><span class="s"> to </span><span class="i">#{</span>directory<span class="i">}</span><span class="s">&quot;</span>
<span class="t">File</span>.write(<span class="t">File</span>.join(directory, filename), resource.body)
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span></code></pre>
<h2>Usage</h2>
<h3>Configuration</h3>
<p>Arachnid has a ton of configration options which can be passed to the mehthods listed below in <a href="#crawling" target="_blank">Crawling</a> and to the constructor for <code><a href="Arachnid/Agent.html">Arachnid::Agent</a></code>. They are as follows:</p>
<ul><li><strong>read_timeout</strong> - Read timeout</li><li><strong>connect_timeout</strong> - Connect timeout</li><li><strong>max_redirects</strong> - Maximum amount of redirects to follow</li><li><strong>do_not_track</strong> - Sets the DNT header</li><li><strong>default_headers</strong> - Default HTTP headers to use for all hosts</li><li><strong>host_header</strong> - HTTP host header to use</li><li><strong>host_headers</strong> - HTTP headers to use for specific hosts</li><li><strong>user_agent</strong> - sets the user agent for the crawler</li><li><strong>referer</strong> - Referer to use</li><li><strong>fetch_delay</strong> - Delay in between fetching resources</li><li><strong>queue</strong> - Preload the queue with urls</li><li><strong>history</strong> - Links that should not be visited</li><li><strong>limit</strong> - Maximum number of resources to visit</li><li><strong>max_depth</strong> - Maximum crawl depth</li></ul>
<p>There are also a few class properties on <code><a href="Arachnid.html">Arachnid</a></code> itself which are used as the defaults, unless overrided.</p>
<ul><li><strong>do_not_track</strong></li><li><strong>max_redirects</strong></li><li><strong>connect_timeout</strong></li><li><strong>read_timeout</strong></li><li><strong>user_agent</strong></li></ul>
<h3>Crawling</h3>
<p>Arachnid provides 3 interfaces to use for crawling:</p>
<h4><code>Arachnid#start_at(url, **options, &block : Agent ->)</code></h4>
<p><code>start_at</code> is what you want to use if you're going to be doing a full crawl of multiple sites. It doesn't filter any urls by default and will scan every link it encounters.</p>
<h4><code>Arachnid#site(url, **options, &block : Agent ->)</code></h4>
<p><code>site</code> constrains the crawl to a specific site. "site" in this case is defined as all paths within a domain and it's subdomains.</p>
<h4><code>Arachnid#host(name, **options, &block : Agent ->)</code></h4>
<p><code>host</code> is similar to site, but stays within the domain, not crawling subdomains.</p>
<p><em>Maybe <code>site</code> and <code>host</code> should be swapped? I don't know what is more intuitive.</em></p>
<h3>Crawling Rules</h3>
<p>Arachnid has the concept of <strong>filters</strong> for the purpose of filtering urls before visiting them. They are as follows:</p>
<ul><li><strong>hosts</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_hosts_like%28pattern%29-instance-method" target="_blank">visit_hosts_like(pattern : String | Regex)</a></li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_hosts_like%28pattern%29-instance-method" target="_blank">ignore_hosts_like(pattern : String | Regex)</a></li></ul><li><strong>ports</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_ports-instance-method" target="_blank">visit_ports_like(pattern : String | Regex)</a></li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_ports-instance-method" target="_blank">ignore_ports_like(pattern : String | Regex)</a></li></ul><li><strong>ports</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_ports_like%28pattern%29-instance-method" target="_blank">visit_ports_like(pattern : String | Regex)</a></li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_ports_like%28pattern%29-instance-method" target="_blank">ignore_ports_like(pattern : String | Regex)</a></li></ul><li><strong>links</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_links_like(pattern" target="_blank">visit_links_like(pattern : String | Regex)</a>-instance-method)</li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_links_like(pattern" target="_blank">ignore_links_like(pattern : String | Regex)</a>-instance-method)</li></ul><li><strong>urls</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_urls_like%28pattern%29-instance-method" target="_blank">visit_urls_like(pattern : String | Regex)</a></li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_urls_like%28pattern%29-instance-method" target="_blank">ignore_urls_like(pattern : String | Regex)</a></li></ul><li><strong>exts</strong></li><ul><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_exts_like%28pattern%29-instance-method" target="_blank">visit_exts_like(pattern : String | Regex)</a></li><li><a href="https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_exts_like%28pattern%29-instance-method" target="_blank">ignore_exts_like(pattern : String | Regex)</a></li></ul></ul>
<p>All of these methods have the ability to also take a block instead of a pattern, where the block returns true or false. The only difference between <code>links</code> and <code>urls</code> in this case is with the block argument. <code>links</code> receives a <code>String</code> and <code>urls</code> a <code><a href="URI.html">URI</a></code>. Honestly I'll probably get rid of <code>links</code> soon and just make it <code>urls</code>.</p>
<p><code>exts</code> looks at the extension, if it exists, and fiters base on that.</p>
<h3>Events</h3>
<p>Every crawled "page" is referred to as a resource, since sometimes they will be html/xml, sometimes javascript or css, and sometimes images, videos, zip files, etc. Every time a resource is scanned one of several events is called. They are:</p>
<h4><code>every_url(&block : <a href="URI.html">URI</a> ->)</code></h4>
<p>Pass each URL from each resource visited to the given block.</p>
<h4><code>every_failed_url(&block : <a href="URI.html">URI</a> ->)</code></h4>
<p>Pass each URL that could not be requested to the given block.</p>
<h4><code>every_url_like(pattern, &block : <a href="URI.html">URI</a> ->)</code></h4>
<p>Pass every URL that the agent visits, and matches a given pattern, to a given block.</p>
<h4><code>urls_like(pattern, &block : <a href="URI.html">URI</a> ->)</code></h4>
<p>Same as <code>every_url_like</code></p>
<h4><code>all_headers(&block : HTTP::Headers)</code></h4>
<p>Pass the headers from every response the agent receives to a given block.</p>
<h4><code>every_resource(&block : Resource ->)</code></h4>
<p>Pass every resource that the agent visits to a given block.</p>
<h4><code>every_ok_page(&block : Resource ->)</code></h4>
<p>Pass every OK resource that the agent visits to a given block.</p>
<h4><code>every_redirect_page(&block : Resource ->)</code></h4>
<p>Pass every Redirect resource that the agent visits to a given block.</p>
<h4><code>every_timedout_page(&block : Resource ->)</code></h4>
<p>Pass every Timeout resource that the agent visits to a given block.</p>
<h4><code>every_bad_request_page(&block : Resource ->)</code></h4>
<p>Pass every Bad Request resource that the agent visits to a given block.</p>
<h4><code>def every_unauthorized_page(&block : Resource ->)</code></h4>
<p>Pass every Unauthorized resource that the agent visits to a given block.</p>
<h4><code>every_forbidden_page(&block : Resource ->)</code></h4>
<p>Pass every Forbidden resource that the agent visits to a given block.</p>
<h4><code>every_missing_page(&block : Resource ->)</code></h4>
<p>Pass every Missing resource that the agent visits to a given block.</p>
<h4><code>every_internal_server_error_page(&block : Resource ->)</code></h4>
<p>Pass every Internal Server Error resource that the agent visits to a given block.</p>
<h4><code>every_txt_page(&block : Resource ->)</code></h4>
<p>Pass every Plain Text resource that the agent visits to a given block.</p>
<h4><code>every_html_page(&block : Resource ->)</code></h4>
<p>Pass every HTML resource that the agent visits to a given block.</p>
<h4><code>every_xml_page(&block : Resource ->)</code></h4>
<p>Pass every XML resource that the agent visits to a given block.</p>
<h4><code>every_xsl_page(&block : Resource ->)</code></h4>
<p>Pass every XML Stylesheet (XSL) resource that the agent visits to a given block.</p>
<h4><code>every_doc(&block : Document::HTML | XML::Node ->)</code></h4>
<p>Pass every HTML or XML document that the agent parses to a given block.</p>
<h4><code>every_html_doc(&block : Document::HTML | XML::Node ->)</code></h4>
<p>Pass every HTML document that the agent parses to a given block.</p>
<h4><code>every_xml_doc(&block : XML::Node ->)</code></h4>
<p>Pass every XML document that the agent parses to a given block.</p>
<h4><code>every_xsl_doc(&block : XML::Node ->)</code></h4>
<p>Pass every XML Stylesheet (XSL) that the agent parses to a given block.</p>
<h4><code>every_rss_doc(&block : XML::Node ->)</code></h4>
<p>Pass every RSS document that the agent parses to a given block.</p>
<h4><code>every_atom_doc(&block : XML::Node ->)</code></h4>
<p>Pass every Atom document that the agent parses to a given block.</p>
<h4><code>every_javascript(&block : Resource ->)</code></h4>
<p>Pass every JavaScript resource that the agent visits to a given block.</p>
<h4><code>every_css(&block : Resource ->)</code></h4>
<p>Pass every CSS resource that the agent visits to a given block.</p>
<h4><code>every_rss(&block : Resource ->)</code></h4>
<p>Pass every RSS feed that the agent visits to a given block.</p>
<h4><code>every_atom(&block : Resource ->)</code></h4>
<p>Pass every Atom feed that the agent visits to a given block.</p>
<h4><code>every_ms_word(&block : Resource ->)</code></h4>
<p>Pass every MS Word resource that the agent visits to a given block.</p>
<h4><code>every_pdf(&block : Resource ->)</code></h4>
<p>Pass every PDF resource that the agent visits to a given block.</p>
<h4><code>every_zip(&block : Resource ->)</code></h4>
<p>Pass every ZIP resource that the agent visits to a given block.</p>
<h4><code>every_image(&block : Resource ->)</code></h4>
<p>Passes every image resource to the given block.</p>
<h4><code>every_content_type(content_type : String | Regex, &block : Resource ->)</code></h4>
<p>Passes every resource with a matching content type to the given block.</p>
<h4><code>every_link(&block : <a href="URI.html">URI</a>, <a href="URI.html">URI</a> ->)</code></h4>
<p>Passes every origin and destination URI of each link to a given block.</p>
<h3>Content Types</h3>
<p>Every resource has an associated content type and the <code>Resource</code> class itself provides several easy methods to check it. You can find all of them <a href="https://watzon.github.io/arachnid/Arachnid/Resource/ContentTypes.html" target="_blank">here</a>.</p>
<h3>Parsing HTML</h3>
<p>Every HTML/XML resource has full access to the suite of methods provided by <a href="https://github.com/madeindjs/Crystagiri/" target="_blank">Crystagiri</a> allowing you to more easily search by css selector.</p>
<h2>Contributing</h2>
<ol><li>Fork it (&lt;https://github.com/watzon/arachnid/fork>)</li><li>Create your feature branch (<code>git checkout -b my-new-feature</code>)</li><li>Commit your changes (<code>git commit -am 'Add some feature'</code>)</li><li>Push to the branch (<code>git push origin my-new-feature</code>)</li><li>Create a new Pull Request</li></ol>
<h2>Contributors</h2>
<ul><li><a href="https://github.com/watzon" target="_blank">Chris Watson</a> - creator and maintainer</li></ul>
</div>
</body>
</html>