2019-06-27 03:25:07 +00:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta name = "generator" content = "Crystal Docs 0.29.0" >
< link href = "css/style.css" rel = "stylesheet" type = "text/css" >
< script type = "text/javascript" src = "js/doc.js" > < / script >
< script type = "text/javascript" >
CrystalDoc.base_path = "";
< / script >
< meta id = "repository-name" content = "github.com/watzon/arachnid" >
< title > README - github.com/watzon/arachnid< / title >
< / head >
< body >
< div class = "sidebar" >
< div class = "sidebar-header" >
< div class = "search-box" >
< input type = "search" class = "search-input" placeholder = "Search..." spellcheck = "false" aria-label = "Search" >
< / div >
< div class = "repository-links" >
< a href = "index.html" > README< / a >
< / div >
< / div >
< div class = "search-results" class = "hidden" >
< ul class = "search-list" > < / ul >
< / div >
< div class = "types-list" >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid" data-name = "arachnid" >
< a href = "Arachnid.html" > Arachnid< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Agent" data-name = "arachnid::agent" >
< a href = "Arachnid/Agent.html" > Agent< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions" data-name = "arachnid::agent::actions" >
< a href = "Arachnid/Agent/Actions.html" > Actions< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions/Action" data-name = "arachnid::agent::actions::action" >
< a href = "Arachnid/Agent/Actions/Action.html" > Action< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions/Paused" data-name = "arachnid::agent::actions::paused" >
< a href = "Arachnid/Agent/Actions/Paused.html" > Paused< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions/RuntimeError" data-name = "arachnid::agent::actions::runtimeerror" >
< a href = "Arachnid/Agent/Actions/RuntimeError.html" > RuntimeError< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions/SkipLink" data-name = "arachnid::agent::actions::skiplink" >
< a href = "Arachnid/Agent/Actions/SkipLink.html" > SkipLink< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Actions/SkipResource" data-name = "arachnid::agent::actions::skipresource" >
< a href = "Arachnid/Agent/Actions/SkipResource.html" > SkipResource< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Agent/Queue" data-name = "arachnid::agent::queue" >
< a href = "Arachnid/Agent/Queue.html" > Queue< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/AuthCredential" data-name = "arachnid::authcredential" >
< a href = "Arachnid/AuthCredential.html" > AuthCredential< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/AuthStore" data-name = "arachnid::authstore" >
< a href = "Arachnid/AuthStore.html" > AuthStore< / a >
< / li >
2019-06-30 23:30:15 +00:00
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli" data-name = "arachnid::cli" >
< a href = "Arachnid/Cli.html" > Cli< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Action" data-name = "arachnid::cli::action" >
< a href = "Arachnid/Cli/Action.html" > Action< / a >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library" data-name = "arachnid::cli::command_main_command_of_clim_library" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library.html" > Command_Main_command_of_clim_library< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap.html" > Command_Sitemap< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap.html" > Options_Sitemap< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_json" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json.html" > Option_json< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_output" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output.html" > Option_output< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_xml" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml.html" > Option_xml< / a >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap.html" > Options_Sitemap< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_json" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_json.html" > Option_json< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_output" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_output.html" > Option_output< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::options_sitemap::option_xml" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/Options_Sitemap/Option_xml.html" > Option_xml< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/RunProc" data-name = "arachnid::cli::command_main_command_of_clim_library::command_sitemap::runproc" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Sitemap/RunProc.html" > RunProc< / a >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize.html" > Command_Summarize< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize.html" > Options_Summarize< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_codes" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes.html" > Option_codes< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_elinks" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks.html" > Option_elinks< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_ilinks" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks.html" > Option_ilinks< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_limit" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit.html" > Option_limit< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_output" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output.html" > Option_output< / a >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize.html" > Options_Summarize< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_codes" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_codes.html" > Option_codes< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_elinks" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_elinks.html" > Option_elinks< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_ilinks" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_ilinks.html" > Option_ilinks< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_limit" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_limit.html" > Option_limit< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::options_summarize::option_output" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/Options_Summarize/Option_output.html" > Option_output< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/RunProc" data-name = "arachnid::cli::command_main_command_of_clim_library::command_summarize::runproc" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Command_Summarize/RunProc.html" > RunProc< / a >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library.html" > Options_Main_command_of_clim_library< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_version" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version.html" > Option_version< / a >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library.html" > Options_Main_command_of_clim_library< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_help" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_help.html" > Option_help< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version" data-name = "arachnid::cli::command_main_command_of_clim_library::options_main_command_of_clim_library::option_version" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/Options_Main_command_of_clim_library/Option_version.html" > Option_version< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Command_Main_command_of_clim_library/RunProc" data-name = "arachnid::cli::command_main_command_of_clim_library::runproc" >
< a href = "Arachnid/Cli/Command_Main_command_of_clim_library/RunProc.html" > RunProc< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Count" data-name = "arachnid::cli::count" >
< a href = "Arachnid/Cli/Count.html" > Count< / a >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Sitemap" data-name = "arachnid::cli::sitemap" >
< a href = "Arachnid/Cli/Sitemap.html" > Sitemap< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Sitemap/LastMod" data-name = "arachnid::cli::sitemap::lastmod" >
< a href = "Arachnid/Cli/Sitemap/LastMod.html" > LastMod< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Cli/Sitemap/PageMap" data-name = "arachnid::cli::sitemap::pagemap" >
< a href = "Arachnid/Cli/Sitemap/PageMap.html" > PageMap< / a >
< / li >
< / ul >
< / li >
< / ul >
< / li >
2019-06-27 03:25:07 +00:00
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/CookieJar" data-name = "arachnid::cookiejar" >
< a href = "Arachnid/CookieJar.html" > CookieJar< / a >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Document" data-name = "arachnid::document" >
< a href = "Arachnid/Document.html" > Document< / a >
< ul >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Document/HTML" data-name = "arachnid::document::html" >
< a href = "Arachnid/Document/HTML.html" > HTML< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Document/HTML/Tag" data-name = "arachnid::document::html::tag" >
< a href = "Arachnid/Document/HTML/Tag.html" > Tag< / a >
< / li >
< / ul >
< / li >
< / ul >
< / li >
< li class = "parent " data-id = "github.com/watzon/arachnid/Arachnid/Resource" data-name = "arachnid::resource" >
< a href = "Arachnid/Resource.html" > Resource< / a >
< ul >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Resource/ContentTypes" data-name = "arachnid::resource::contenttypes" >
< a href = "Arachnid/Resource/ContentTypes.html" > ContentTypes< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Resource/Cookies" data-name = "arachnid::resource::cookies" >
< a href = "Arachnid/Resource/Cookies.html" > Cookies< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Resource/HTML" data-name = "arachnid::resource::html" >
< a href = "Arachnid/Resource/HTML.html" > HTML< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Resource/StatusCodes" data-name = "arachnid::resource::statuscodes" >
< a href = "Arachnid/Resource/StatusCodes.html" > StatusCodes< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/Rules" data-name = "arachnid::rules(t)" >
< a href = "Arachnid/Rules.html" > Rules< / a >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/Arachnid/SessionCache" data-name = "arachnid::sessioncache" >
< a href = "Arachnid/SessionCache.html" > SessionCache< / a >
< / li >
< / ul >
< / li >
< li class = " " data-id = "github.com/watzon/arachnid/URI" data-name = "uri" >
< a href = "URI.html" > URI< / a >
< / li >
< / ul >
< / div >
< / div >
< div class = "main-content" >
< h1 > Arachnid< / h1 >
< p > Arachnid is a fast and powerful web scraping framework for Crystal. It provides an easy to use DSL for scraping webpages and processing all of the things you might come across.< / p >
< ul > < li > < a href = "#Arachnid" target = "_blank" > Arachnid< / a > < / li > < ul > < li > < a href = "#Installation" target = "_blank" > Installation< / a > < / li > < li > < a href = "#Examples" target = "_blank" > Examples< / a > < / li > < li > < a href = "#Usage" target = "_blank" > Usage< / a > < / li > < li > < a href = "#Configuration" target = "_blank" > Configuration< / a > < / li > < li > < a href = "#Crawling" target = "_blank" > Crawling< / a > < / li > < li > < a href = "#Arachnidstartaturl-options-block--Agent" target = "_blank" > < code > Arachnid#start_at(url, **options, & block : Agent ->)< / code > < / a > < / li > < li > < a href = "#Arachnidsiteurl-options-block--Agent" target = "_blank" > < code > Arachnid#site(url, **options, & block : Agent ->)< / code > < / a > < / li > < li > < a href = "#Arachnidhostname-options-block--Agent" target = "_blank" > < code > Arachnid#host(name, **options, & block : Agent ->)< / code > < / a > < / li > < li > < a href = "#Crawling-Rules" target = "_blank" > Crawling Rules< / a > < / li > < li > < a href = "#Events" target = "_blank" > Events< / a > < / li > < li > < a href = "#everyurlblock--URI" target = "_blank" > < code > every_url(& block : URI ->)< / code > < / a > < / li > < li > < a href = "#everyfailedurlblock--URI" target = "_blank" > < code > every_failed_url(& block : URI ->)< / code > < / a > < / li > < li > < a href = "#everyurllikepattern-block--URI" target = "_blank" > < code > every_url_like(pattern, & block : URI ->)< / code > < / a > < / li > < li > < a href = "#urlslikepattern-block--URI" target = "_blank" > < code > urls_like(pattern, & block : URI ->)< / code > < / a > < / li > < li > < a href = "#allheadersblock--HTTPHeaders" target = "_blank" > < code > all_headers(& block : HTTP::Headers)< / code > < / a > < / li > < li > < a href = "#everyresourceblock--Resource" target = "_blank" > < code > every_resource(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyokpageblock--Resource" target = "_blank" > < code > every_ok_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyredirectpageblock--Resource" target = "_blank" > < code > every_redirect_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everytimedoutpageblock--Resource" target = "_blank" > < code > every_timedout_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everybadrequestpageblock--Resource" target = "_blank" > < code > every_bad_request_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#def-everyunauthorizedpageblock--Resource" target = "_blank" > < code > def every_unauthorized_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyforbiddenpageblock--Resource" target = "_blank" > < code > every_forbidden_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everymissingpageblock--Resource" target = "_blank" > < code > every_missing_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyinternalservererrorpageblock--Resource" target = "_blank" > < code > every_internal_server_error_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everytxtpageblock--Resource" target = "_blank" > < code > every_txt_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyhtmlpageblock--Resource" target = "_blank" > < code > every_html_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyxmlpageblock--Resource" target = "_blank" > < code > every_xml_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everyxslpageblock--Resource" target = "_blank" > < code > every_xsl_page(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everydocblock--DocumentHTML--XMLNode" target = "_blank" > < code > every_doc(& block : Document::HTML | XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyhtmldocblock--DocumentHTML--XMLNode" target = "_blank" > < code > every_html_doc(& block : Document::HTML | XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyxmldocblock--XMLNode" target = "_blank" > < code > every_xml_doc(& block : XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyxsldocblock--XMLNode" target = "_blank" > < code > every_xsl_doc(& block : XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyrssdocblock--XMLNode" target = "_blank" > < code > every_rss_doc(& block : XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyatomdocblock--XMLNode" target = "_blank" > < code > every_atom_doc(& block : XML::Node ->)< / code > < / a > < / li > < li > < a href = "#everyjavascriptblock--Resource" target = "_blank" > < code > every_javascript(& block : Resource ->)< / code > < / a > < / li > < li > < a href = "#everycssblock--Resource" target = "_blank" > < code > every_css(& block : Resource ->)< / code > < / a > < / li > < li > < a
< h2 > Installation< / h2 >
< ol > < li > Add the dependency to your < code > shard.yml< / code > :< / li > < / ol >
< p > < code > < / code > `yaml
dependencies:< / p >
< pre > < code > arachnid:
github: watzon< span class = "s" > /arachnid
version: ~> 0.1.0< / code > < / pre >
< p > < code > < / code > `< / p >
< ol > < li > Run < code > shards install< / code > < / li > < / ol >
< h2 > Examples< / h2 >
< p > Arachnid provides an easy to use, powerful DSL for scraping websites.< / p >
< pre > < code class = "language-crystal" > < span class = "k" > require< / span > < span class = "s" > " arachnid" < / span >
< span class = "k" > require< / span > < span class = "s" > " json" < / span >
< span class = "c" > # Let' s build a sitemap of crystal-lang.org< / span >
< span class = "c" > # Links will be a hash of url to resource title< / span >
links < span class = "o" > =< / span > {} < span class = "k" > of< / span > < span class = "t" > String< / span > => < span class = "t" > String< / span >
< span class = "c" > # Visit a particular host, in this case `crystal-lang.org`. This will< / span >
< span class = "c" > # not match on subdomains.< / span >
< span class = "t" > Arachnid< / span > .host(< span class = "s" > " https://crystal-lang.org" < / span > ) < span class = "k" > do< / span > < span class = "o" > |< / span > spider< span class = "o" > |< / span >
< span class = "c" > # Ignore the API secion. It' s a little big.< / span >
spider.ignore_urls_like(< span class = "s" > /\/(api)\//< / span > )
spider.every_html_page < span class = "k" > do< / span > < span class = "o" > |< / span > page< span class = "o" > |< / span >
puts < span class = "s" > " Visiting < / span > < span class = "i" > #{< / span > page.url.to_s< span class = "i" > }< / span > < span class = "s" > " < / span >
< span class = "c" > # Ignore redirects for our sitemap< / span >
< span class = "k" > unless< / span > page.redirect?
< span class = "c" > # Add the url of every visited page to our sitemap< / span >
links[page.url.to_s] < span class = "o" > =< / span > page.title.to_s.strip
< span class = "k" > end< / span >
< span class = "k" > end< / span >
< span class = "k" > end< / span >
< span class = "t" > File< / span > .write(< span class = "s" > " crystal-lang.org-sitemap.json" < / span > , links.to_pretty_json)< / code > < / pre >
< p > Want to scan external links as well?< / p >
< pre > < code class = "language-crystal" > < span class = "c" > # To make things interesting, this time let' s download< / span >
< span class = "c" > # every image we find.< / span >
< span class = "t" > Arachnid< / span > .start_at(< span class = "s" > " https://crystal-lang.org" < / span > ) < span class = "k" > do< / span > < span class = "o" > |< / span > spider< span class = "o" > |< / span >
< span class = "c" > # Set a base path to store all the images at< / span >
base_image_dir < span class = "o" > =< / span > < span class = "t" > File< / span > .expand_path(< span class = "s" > " ~/Pictures/arachnid" < / span > )
< span class = "t" > Dir< / span > .mkdir_p(base_image_dir)
< span class = "c" > # You could also use `every_image`. This allows us to< / span >
< span class = "c" > # track the crawler though.< / span >
spider.every_resource < span class = "k" > do< / span > < span class = "o" > |< / span > resource< span class = "o" > |< / span >
puts < span class = "s" > " Scanning < / span > < span class = "i" > #{< / span > resource.url.to_s< span class = "i" > }< / span > < span class = "s" > " < / span >
< span class = "k" > if< / span > resource.image?
< span class = "c" > # Since we' re going to be saving a lot of images< / span >
< span class = "c" > # let' s spawn a new fiber for each one. This< / span >
< span class = "c" > # makes things so much faster.< / span >
spawn < span class = "k" > do< / span >
< span class = "c" > # Output directory for images for this host< / span >
directory < span class = "o" > =< / span > < span class = "t" > File< / span > .join(base_image_dir, resource.url.host.to_s)
< span class = "t" > Dir< / span > .mkdir_p(directory)
< span class = "c" > # The name of the image< / span >
filename < span class = "o" > =< / span > < span class = "t" > File< / span > .basename(resource.url.path)
< span class = "c" > # Save the image using the body of the resource< / span >
puts < span class = "s" > " Saving < / span > < span class = "i" > #{< / span > filename< span class = "i" > }< / span > < span class = "s" > to < / span > < span class = "i" > #{< / span > directory< span class = "i" > }< / span > < span class = "s" > " < / span >
< span class = "t" > File< / span > .write(< span class = "t" > File< / span > .join(directory, filename), resource.body)
< span class = "k" > end< / span >
< span class = "k" > end< / span >
< span class = "k" > end< / span >
< span class = "k" > end< / span > < / code > < / pre >
< h2 > Usage< / h2 >
< h3 > Configuration< / h3 >
< p > Arachnid has a ton of configration options which can be passed to the mehthods listed below in < a href = "#crawling" target = "_blank" > Crawling< / a > and to the constructor for < code > < a href = "Arachnid/Agent.html" > Arachnid::Agent< / a > < / code > . They are as follows:< / p >
2019-06-30 23:30:15 +00:00
< ul > < li > < strong > read_timeout< / strong > - Read timeout< / li > < li > < strong > connect_timeout< / strong > - Connect timeout< / li > < li > < strong > max_redirects< / strong > - Maximum amount of redirects to follow< / li > < li > < strong > do_not_track< / strong > - Sets the DNT header< / li > < li > < strong > default_headers< / strong > - Default HTTP headers to use for all hosts< / li > < li > < strong > host_header< / strong > - HTTP host header to use< / li > < li > < strong > host_headers< / strong > - HTTP headers to use for specific hosts< / li > < li > < strong > user_agent< / strong > - sets the user agent for the crawler< / li > < li > < strong > referer< / strong > - Referer to use< / li > < li > < strong > fetch_delay< / strong > - Delay in between fetching resources< / li > < li > < strong > queue< / strong > - Preload the queue with urls< / li > < li > < strong > history< / strong > - Links that should not be visited< / li > < li > < strong > limit< / strong > - Maximum number of resources to visit< / li > < li > < strong > max_depth< / strong > - Maximum crawl depth< / li > < / ul >
2019-06-27 03:25:07 +00:00
< p > There are also a few class properties on < code > < a href = "Arachnid.html" > Arachnid< / a > < / code > itself which are used as the defaults, unless overrided.< / p >
< ul > < li > < strong > do_not_track< / strong > < / li > < li > < strong > max_redirects< / strong > < / li > < li > < strong > connect_timeout< / strong > < / li > < li > < strong > read_timeout< / strong > < / li > < li > < strong > user_agent< / strong > < / li > < / ul >
< h3 > Crawling< / h3 >
< p > Arachnid provides 3 interfaces to use for crawling:< / p >
< h4 > < code > Arachnid#start_at(url, **options, & block : Agent ->)< / code > < / h4 >
< p > < code > start_at< / code > is what you want to use if you're going to be doing a full crawl of multiple sites. It doesn't filter any urls by default and will scan every link it encounters.< / p >
< h4 > < code > Arachnid#site(url, **options, & block : Agent ->)< / code > < / h4 >
< p > < code > site< / code > constrains the crawl to a specific site. "site" in this case is defined as all paths within a domain and it's subdomains.< / p >
< h4 > < code > Arachnid#host(name, **options, & block : Agent ->)< / code > < / h4 >
< p > < code > host< / code > is similar to site, but stays within the domain, not crawling subdomains.< / p >
< p > < em > Maybe < code > site< / code > and < code > host< / code > should be swapped? I don't know what is more intuitive.< / em > < / p >
< h3 > Crawling Rules< / h3 >
< p > Arachnid has the concept of < strong > filters< / strong > for the purpose of filtering urls before visiting them. They are as follows:< / p >
2019-06-30 23:30:15 +00:00
< ul > < li > < strong > hosts< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_hosts_like%28pattern%29-instance-method" target = "_blank" > visit_hosts_like(pattern : String | Regex)< / a > < / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_hosts_like%28pattern%29-instance-method" target = "_blank" > ignore_hosts_like(pattern : String | Regex)< / a > < / li > < / ul > < li > < strong > ports< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_ports-instance-method" target = "_blank" > visit_ports_like(pattern : String | Regex)< / a > < / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_ports-instance-method" target = "_blank" > ignore_ports_like(pattern : String | Regex)< / a > < / li > < / ul > < li > < strong > ports< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_ports_like%28pattern%29-instance-method" target = "_blank" > visit_ports_like(pattern : String | Regex)< / a > < / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_ports_like%28pattern%29-instance-method" target = "_blank" > ignore_ports_like(pattern : String | Regex)< / a > < / li > < / ul > < li > < strong > links< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_links_like(pattern" target = "_blank" > visit_links_like(pattern : String | Regex)< / a > -instance-method)< / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_links_like(pattern" target = "_blank" > ignore_links_like(pattern : String | Regex)< / a > -instance-method)< / li > < / ul > < li > < strong > urls< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_urls_like%28pattern%29-instance-method" target = "_blank" > visit_urls_like(pattern : String | Regex)< / a > < / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_urls_like%28pattern%29-instance-method" target = "_blank" > ignore_urls_like(pattern : String | Regex)< / a > < / li > < / ul > < li > < strong > exts< / strong > < / li > < ul > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#visit_exts_like%28pattern%29-instance-method" target = "_blank" > visit_exts_like(pattern : String | Regex)< / a > < / li > < li > < a href = "https://watzon.github.io/arachnid/Arachnid/Agent.html#ignore_exts_like%28pattern%29-instance-method" target = "_blank" > ignore_exts_like(pattern : String | Regex)< / a > < / li > < / ul > < / ul >
2019-06-27 03:25:07 +00:00
< p > All of these methods have the ability to also take a block instead of a pattern, where the block returns true or false. The only difference between < code > links< / code > and < code > urls< / code > in this case is with the block argument. < code > links< / code > receives a < code > String< / code > and < code > urls< / code > a < code > < a href = "URI.html" > URI< / a > < / code > . Honestly I'll probably get rid of < code > links< / code > soon and just make it < code > urls< / code > .< / p >
< p > < code > exts< / code > looks at the extension, if it exists, and fiters base on that.< / p >
< h3 > Events< / h3 >
< p > Every crawled "page" is referred to as a resource, since sometimes they will be html/xml, sometimes javascript or css, and sometimes images, videos, zip files, etc. Every time a resource is scanned one of several events is called. They are:< / p >
< h4 > < code > every_url(& block : < a href = "URI.html" > URI< / a > ->)< / code > < / h4 >
< p > Pass each URL from each resource visited to the given block.< / p >
< h4 > < code > every_failed_url(& block : < a href = "URI.html" > URI< / a > ->)< / code > < / h4 >
< p > Pass each URL that could not be requested to the given block.< / p >
< h4 > < code > every_url_like(pattern, & block : < a href = "URI.html" > URI< / a > ->)< / code > < / h4 >
< p > Pass every URL that the agent visits, and matches a given pattern, to a given block.< / p >
< h4 > < code > urls_like(pattern, & block : < a href = "URI.html" > URI< / a > ->)< / code > < / h4 >
< p > Same as < code > every_url_like< / code > < / p >
< h4 > < code > all_headers(& block : HTTP::Headers)< / code > < / h4 >
< p > Pass the headers from every response the agent receives to a given block.< / p >
< h4 > < code > every_resource(& block : Resource ->)< / code > < / h4 >
< p > Pass every resource that the agent visits to a given block.< / p >
< h4 > < code > every_ok_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every OK resource that the agent visits to a given block.< / p >
< h4 > < code > every_redirect_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Redirect resource that the agent visits to a given block.< / p >
< h4 > < code > every_timedout_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Timeout resource that the agent visits to a given block.< / p >
< h4 > < code > every_bad_request_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Bad Request resource that the agent visits to a given block.< / p >
< h4 > < code > def every_unauthorized_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Unauthorized resource that the agent visits to a given block.< / p >
< h4 > < code > every_forbidden_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Forbidden resource that the agent visits to a given block.< / p >
< h4 > < code > every_missing_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Missing resource that the agent visits to a given block.< / p >
< h4 > < code > every_internal_server_error_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Internal Server Error resource that the agent visits to a given block.< / p >
< h4 > < code > every_txt_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every Plain Text resource that the agent visits to a given block.< / p >
< h4 > < code > every_html_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every HTML resource that the agent visits to a given block.< / p >
< h4 > < code > every_xml_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every XML resource that the agent visits to a given block.< / p >
< h4 > < code > every_xsl_page(& block : Resource ->)< / code > < / h4 >
< p > Pass every XML Stylesheet (XSL) resource that the agent visits to a given block.< / p >
< h4 > < code > every_doc(& block : Document::HTML | XML::Node ->)< / code > < / h4 >
< p > Pass every HTML or XML document that the agent parses to a given block.< / p >
< h4 > < code > every_html_doc(& block : Document::HTML | XML::Node ->)< / code > < / h4 >
< p > Pass every HTML document that the agent parses to a given block.< / p >
< h4 > < code > every_xml_doc(& block : XML::Node ->)< / code > < / h4 >
< p > Pass every XML document that the agent parses to a given block.< / p >
< h4 > < code > every_xsl_doc(& block : XML::Node ->)< / code > < / h4 >
< p > Pass every XML Stylesheet (XSL) that the agent parses to a given block.< / p >
< h4 > < code > every_rss_doc(& block : XML::Node ->)< / code > < / h4 >
< p > Pass every RSS document that the agent parses to a given block.< / p >
< h4 > < code > every_atom_doc(& block : XML::Node ->)< / code > < / h4 >
< p > Pass every Atom document that the agent parses to a given block.< / p >
< h4 > < code > every_javascript(& block : Resource ->)< / code > < / h4 >
< p > Pass every JavaScript resource that the agent visits to a given block.< / p >
< h4 > < code > every_css(& block : Resource ->)< / code > < / h4 >
< p > Pass every CSS resource that the agent visits to a given block.< / p >
< h4 > < code > every_rss(& block : Resource ->)< / code > < / h4 >
< p > Pass every RSS feed that the agent visits to a given block.< / p >
< h4 > < code > every_atom(& block : Resource ->)< / code > < / h4 >
< p > Pass every Atom feed that the agent visits to a given block.< / p >
< h4 > < code > every_ms_word(& block : Resource ->)< / code > < / h4 >
< p > Pass every MS Word resource that the agent visits to a given block.< / p >
< h4 > < code > every_pdf(& block : Resource ->)< / code > < / h4 >
< p > Pass every PDF resource that the agent visits to a given block.< / p >
< h4 > < code > every_zip(& block : Resource ->)< / code > < / h4 >
< p > Pass every ZIP resource that the agent visits to a given block.< / p >
< h4 > < code > every_image(& block : Resource ->)< / code > < / h4 >
< p > Passes every image resource to the given block.< / p >
< h4 > < code > every_content_type(content_type : String | Regex, & block : Resource ->)< / code > < / h4 >
< p > Passes every resource with a matching content type to the given block.< / p >
< h4 > < code > every_link(& block : < a href = "URI.html" > URI< / a > , < a href = "URI.html" > URI< / a > ->)< / code > < / h4 >
< p > Passes every origin and destination URI of each link to a given block.< / p >
< h3 > Content Types< / h3 >
2019-06-30 23:30:15 +00:00
< p > Every resource has an associated content type and the < code > Resource< / code > class itself provides several easy methods to check it. You can find all of them < a href = "https://watzon.github.io/arachnid/Arachnid/Resource/ContentTypes.html" target = "_blank" > here< / a > .< / p >
2019-06-27 03:25:07 +00:00
< h3 > Parsing HTML< / h3 >
2019-06-30 23:30:15 +00:00
< p > Every HTML/XML resource has full access to the suite of methods provided by < a href = "https://github.com/madeindjs/Crystagiri/" target = "_blank" > Crystagiri< / a > allowing you to more easily search by css selector.< / p >
2019-06-27 03:25:07 +00:00
< h2 > Contributing< / h2 >
< ol > < li > Fork it (< https://github.com/watzon/arachnid/fork>)< / li > < li > Create your feature branch (< code > git checkout -b my-new-feature< / code > )< / li > < li > Commit your changes (< code > git commit -am 'Add some feature'< / code > )< / li > < li > Push to the branch (< code > git push origin my-new-feature< / code > )< / li > < li > Create a new Pull Request< / li > < / ol >
< h2 > Contributors< / h2 >
< ul > < li > < a href = "https://github.com/watzon" target = "_blank" > Chris Watson< / a > - creator and maintainer< / li > < / ul >
< / div >
< / body >
< / html >