diff --git a/README.md b/README.md index 69bd7f0..48e70b5 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -# arachnid +# Arachnid -TODO: Write a description here +Arachnid is a fast, soon to be multi-threading capable web crawler for Crystal. It recenty underwent a full rewrite for Crystal 0.35.1, so see the documentation below for updated usage instructions. ## Installation @@ -9,26 +9,152 @@ TODO: Write a description here ```yaml dependencies: arachnid: - github: your-github-user/arachnid + github: watzon/arachnid ``` 2. Run `shards install` ## Usage +First, of course, you need to require arachnid in your project: + ```crystal require "arachnid" ``` -TODO: Write usage instructions here +### The Agent -## Development +`Agent` is the class that does all the heavy lifting and will be the main one you interact with. To create a new `Agent`, use `Agent.new`. -TODO: Write development instructions here +```crystal +agent = Arachnid::Agent.new +``` + +The initialize method takes a bunch of optional parameters: + +#### `:client` + +You can, if you wish, supply your own `HTTP::Client` instance to the `Agent`. This can be useful if you want to use a proxy, provided the proxy client extends `HTTP::Client`. + +#### `:user_agent` + +The user agent to be added to every request header. You can override this on a per-host basis with either `:host_headers` or `:default_headers`. + +#### `:default_headers` + +The default headers to be used in every request. + +#### `:host_headers` + +Headers to be applied on a per-host basis. This is a hash `String (host name) => HTTP::Headers`. + +#### `:queue` + +The `Arachnid::Queue` instance to use for storing links waiting to be processed. The default is a `MemoryQueue` (which is the only one for now), but you can easily implement your own `Queue` using whatever you want as a backend. + +#### `:stop_on_empty` + +Whether or not to stop running when the queue is empty. This is true by default. If it's made false, the loop will continue when the queue empties, so be sure you have a way to keep adding items to the queue. + +#### `:follow_redirects` + +Whether or not to follow redirects (add them to the queue). + +### Starting the Agent + +There are four ways to start your Agent once it's been created. Here are some examples: + +#### `#start_at` + +`#start_at` starts the Agent running on a particular URL. It adds a single URL to the queue and starts there. + +```crystal +agent.start_at("https://crystal-lang.org") do + # ... +end +``` + +#### `#site` + +`#site` starts the agent running at the given URL and adds a rule that keeps the agent restricted to the given site. This allows the agent to scan the given domain and any subdomains. For instance: + +```crystal +agent.site("https://crystal-lang.org") do + # ... +end +``` + +The above will match `crystal-lang.org` and `forum.crystal-lang.org`, but not `github.com/crystal-lang` or any other site not within the `*.crystal-lang.org` space. + +#### `#host` + +`#host` is like site, but with the added restriction of just remaining on the current domain path. Subdomains are not included. + +```crystal +agent.host("crystal-lang.org") do + # ... +end +``` + +#### `#start` + +Provided you already have URIs in the queue ready to be scanned, you can also just use `#start` to start the Agent running. + +```crystal +agent.enqueue("https://crystal-lang.org") +agent.enqueue("https://kemalcr.com") +agent.start +``` + +### Filters + +URI's can be filtered before being enqueued. There are two kinds of filters, accept and reject. Accept filters can be used to ensure that a URI matches before being enqueued. Reject filters do the opposite, keeping URIs from being enqueued if they _do_ match. + +For instance: + +```crystal +# This will filter out all sites where the host is not "crystal-lang.org" +agent.accept_filter { |uri| uri.host == "crystal-lang.org" } +``` + +If you want to ignore certain parts of the above filter: + +```crystal +# This will ignore paths starting with "/api" +agent.reject_filter { |uri| uri.path.to_s.starts_with?("/api") } +``` + +The `#site` and `#host` methods add a default accept filter in order to keep things in the given site or host. + +### Resources + +All the above is useless if you can't do anything with the scanned resources, which is why we have the `Resource` class. Every scanned resource is converted into a `Resource` (or subclass) based on the content type. For instance, `text/html` becomes a `Resource::HTML` which is parsed using [kostya/myhtml](https://github.com/kostya/myhtml) for extra speed. + +Each resource has an associated `Agent#on_` method so you can do something when one of those resources is scanned: + +```crystal +agent.on_html do |page| + puts typeof(page) + # => Arachnid::Resource::HTML + + puts page.title + # => The Title of the Page +end +``` + +Currently we have: + +- `#on_html` +- `#on_image` +- `#on_script` +- `#on_stylesheet` +- `#on_xml` + +There is also `#on_resource` which is called for every resource, including ones that don't match the above types. Resources all include, at minimum the URI at which the resource was found, and the response (`HTTP::Client::Response`) instance. ## Contributing -1. Fork it () +1. Fork it () 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) @@ -36,4 +162,4 @@ TODO: Write development instructions here ## Contributors -- [your-name-here](https://github.com/your-github-user) - creator and maintainer +- [your-name-here](https://github.com/watzon) - creator and maintainer diff --git a/shard.yml b/shard.yml index 858eac0..8d4a558 100644 --- a/shard.yml +++ b/shard.yml @@ -8,6 +8,9 @@ dependencies: pool: github: watzon/pool branch: master + myhtml: + github: kostya/myhtml + branch: master crystal: 0.35.0 diff --git a/src/arachnid.cr b/src/arachnid.cr index 1541e22..bbda509 100644 --- a/src/arachnid.cr +++ b/src/arachnid.cr @@ -1,5 +1,11 @@ +require "http/client" + +require "myhtml" +require "pool/connection" + +require "./ext/*" require "./arachnid/*" module Arachnid - + include Logger end diff --git a/src/arachnid/agent.cr b/src/arachnid/agent.cr index 73ec7ac..44e1aa1 100644 --- a/src/arachnid/agent.cr +++ b/src/arachnid/agent.cr @@ -1,15 +1,193 @@ +require "./resource" + module Arachnid class Agent DEFAULT_USER_AGENT = "Arachnid #{Arachnid::VERSION} for Crystal #{Crystal::VERSION}" getter request_handler : RequestHandler + getter accept_filters : Array(Proc(URI, Bool)) + + getter reject_filters : Array(Proc(URI, Bool)) + + getter? running : Bool + + property queue : Queue + + property default_headers : HTTP::Headers + + property host_headers : Hash(String, HTTP::Headers) + + property? stop_on_empty : Bool + + property? follow_redirects : Bool + def initialize(client : (HTTP::Client.class)? = nil, - request_headers = HTTP::Headers.new, - user_agent = DEFAULT_USER_AGENT) + user_agent = DEFAULT_USER_AGENT, + default_headers = HTTP::Headers.new, + host_headers = {} of String => HTTP::Headers, + queue = Queue::Memory.new, + stop_on_empty = true, + follow_redirects = true) client ||= HTTP::Client - request_headers["User-Agent"] ||= user_agent - @request_handler = RequestHandler.new(client, request_headers) + @request_handler = RequestHandler.new(client) + @queue = queue.is_a?(Array) ? Queue::Memory.new : queue + + @user_agent = user_agent + @default_headers = default_headers + @host_headers = host_headers + @stop_on_empty = stop_on_empty + @follow_redirects = follow_redirects + + @accept_filters = [] of Proc(URI, Bool) + @reject_filters = [] of Proc(URI, Bool) + + @running = false + end + + def start_at(uri, &block) + uri = ensure_scheme(uri) + enqueue(uri, force: true) + with self yield self + start + end + + def site(site, &block) + uri = ensure_scheme(site) + enqueue(uri, force: true) + accept_filter { |u| u.host.to_s.ends_with?(uri.host.to_s) } + with self yield self + start + end + + def host(host, &block) + uri = ensure_scheme(host) + enqueue(uri, force: true) + accept_filter { |u| u.host == uri.host } + with self yield self + start + end + + def stop + @running = false + end + + def start + @running = true + + while @running + break if stop_on_empty? && @queue.empty? + unless @queue.empty? + next_uri = @queue.dequeue + + Log.debug { "Scanning #{next_uri.to_s}" } + + headers = build_headers_for(next_uri) + response = @request_handler.request(:get, next_uri, headers: headers) + resource = Resource.from_content_type(next_uri, response) + + @resource_handlers.each do |handler| + handler.call(resource) + end + + # Call the registered handlers for this resource + {% begin %} + case resource + {% for subclass in Arachnid::Resource.subclasses %} + {% resname = subclass.name.split("::").last.downcase.underscore.id %} + when {{ subclass.id }} + @{{ resname }}_handlers.each do |handler| + handler.call(resource) + end + {% end %} + end + {% end %} + + # If the resource has an each_uel method let's pull out + # it's urls and enqueue all of them. + if resource.responds_to?(:each_url) + resource.each_url do |uri| + enqueue(uri) + end + end + + # Check for redirects + if (300..399).includes?(response.status_code) + if location = response.headers["Location"]? + uri = URI.parse(location) + @queue.enqueue(uri) + end + end + end + end + end + + def enqueue(uri, force = false) + uri = ensure_scheme(uri) + if force + @queue.enqueue(uri) + elsif !@queue.includes?(uri) && filter(uri) + @queue.enqueue(uri) + end + end + + def accept_filter(&block : URI -> Bool) + @accept_filters << block + end + + def reject_filter(&block : URI -> Bool) + @reject_filters << block + end + + @resource_handlers = [] of Proc(Resource, Nil) + def on_resource(&block : Resource ->) + @resource_handlers << block + end + + {% for subclass in Arachnid::Resource.subclasses %} + {% resname = subclass.name.split("::").last.downcase.underscore.id %} + @{{ resname }}_handlers = [] of Proc({{ subclass.id }}, Nil) + + # Create a handler for when a {{ subclass.id }} resource is found. The + # resource will be passed to the block. + def on_{{ resname }}(&block : {{ subclass.id }} ->) + @{{ resname }}_handlers << block + end + {% end %} + + private def build_headers_for(uri) + headers = @default_headers.dup + headers["User-Agent"] ||= @user_agent + + if host_headers = @host_headers[uri.host.to_s]? + headers.merge!(host_headers) + end + + # TODO: Authorization and Cookies + + headers + end + + private def filter(uri) + return true if @accept_filters.empty? && @reject_filters.empty? + return false unless !@reject_filters.empty? && !@reject_filters.any?(&.call(uri)) + return false unless @accept_filters.empty? || @accept_filters.any?(&.call(uri)) + true + end + + private def ensure_scheme(uri : URI | String) + if uri.is_a?(URI) + if uri.scheme.nil? || uri.scheme.to_s.empty? + uri.scheme = "http" + end + else + if !uri.starts_with?("http") + uri = "http://#{uri}" + end + uri = URI.parse(uri) + end + + uri.as(URI) end end end diff --git a/src/arachnid/http_client.cr b/src/arachnid/http_client.cr deleted file mode 100644 index 25c5b43..0000000 --- a/src/arachnid/http_client.cr +++ /dev/null @@ -1,7 +0,0 @@ -require "http/client" - -module Arachnid - module HTTPClient - abstract def exec(method : String, path, headers : HTTP::Headers? = nil, body : HTTP::Client::BodyType = nil) : HTTP::Client::Response - end -end diff --git a/src/arachnid/logger.cr b/src/arachnid/logger.cr new file mode 100644 index 0000000..a5db11f --- /dev/null +++ b/src/arachnid/logger.cr @@ -0,0 +1,10 @@ +module Arachnid + module Logger + macro included + {% begin %} + {% tname = @type.name.stringify.split("::").map(&.underscore).join(".") %} + Log = ::Log.for({{ tname }}) + {% end %} + end + end +end diff --git a/src/arachnid/queue.cr b/src/arachnid/queue.cr new file mode 100644 index 0000000..b8bbfc6 --- /dev/null +++ b/src/arachnid/queue.cr @@ -0,0 +1,22 @@ +module Arachnid + # Abstract base class for URL queues. Within Arachnid itself `Queue` implementations + # strive to be thread safe, and custom implementations should as well. + abstract class Queue + # Add a new URL to the queue. + abstract def enqueue(uri : URI) + + # Remove a URL from the queue. + abstract def dequeue : URI + + # Check if the queue is empty. + abstract def empty? : Bool + + # Check if a URL has been enqueued. + abstract def includes?(uri : URI) : Bool + + # Clear the queue, removing all items. + abstract def clear + end +end + +require "./queue/*" diff --git a/src/arachnid/queue/memory.cr b/src/arachnid/queue/memory.cr new file mode 100644 index 0000000..b439535 --- /dev/null +++ b/src/arachnid/queue/memory.cr @@ -0,0 +1,48 @@ +module Arachnid + abstract class Queue + # A basic, thread safe, queue implementation that stores all URLs in memory. + class Memory < Queue + @queue : Deque(URI) + @history : Set(String) + @mutex : Mutex + + def initialize(queue = Deque(URI).new) + @queue = queue.is_a?(Deque) ? queue : Deque.new(queue.to_a) + @history = Set(String).new + @mutex = Mutex.new + end + + def enqueue(uri : URI) + @mutex.synchronize do + @history << "#{uri.host}#{uri.path}" + @queue << uri + end + end + + def dequeue : URI + @mutex.synchronize do + @queue.shift + end + end + + def empty? : Bool + @mutex.synchronize do + @queue.empty? + end + end + + def includes?(uri : URI) : Bool + @mutex.synchronize do + @history.includes?("#{uri.host}#{uri.path}") + end + end + + def clear + @mutex.synchronize do + @queue.clear + @history.clear + end + end + end + end +end diff --git a/src/arachnid/request_handler.cr b/src/arachnid/request_handler.cr index 682aa64..4325c0e 100644 --- a/src/arachnid/request_handler.cr +++ b/src/arachnid/request_handler.cr @@ -1,5 +1,3 @@ -require "pool/connection" - module Arachnid # Class for handling multiple simultanious requests for different hosts. Each host maintains it's own # dedicated pool of HTTP clients to pick from when needed, so as to keep things thread safe. @@ -10,9 +8,6 @@ module Arachnid # providing initializers as class variables. property base_client : HTTP::Client.class - # Any headers that should be sent on every request. - property request_headers : HTTP::Headers - # The maximum number of pools items to store per host. This will be the maximum number # of concurrent connections that any one host can have at a time. property max_pool_size : Int32 @@ -23,9 +18,8 @@ module Arachnid # The maximum amount of time to wait for a request to finish before raising an `IO::TimeoutError`. property connection_timeout : Time::Span - # A client specific TLS context instance. - # TODO: Allow this to be unique to each host. - property tls_context : HTTP::Client::TLSContext + # A map of host name to TLS context for that host + property tls_contexts : Hash(String, HTTP::Client::TLSContext) # A map of host name to connection pool. If `max_hosts` is a non-nil value, this hash will # be limited in size to that number, with older hosts being deleted to save on @@ -34,8 +28,7 @@ module Arachnid # Create a new `RequestHandler` instance. def initialize(@base_client, - @request_headers, - @tls_context : HTTP::Client::TLSContext = nil, + @tls_contexts = {} of String => HTTP::Client::TLSContext, @max_pool_size = 10, @initial_pool_size = 1, @connection_timeout = 1.second) @@ -46,7 +39,6 @@ module Arachnid # throw an `IO::TimeoutError` if a request is made and a new client isn't fetched in time. def request(method, url : String | URI, headers = nil) uri = url.is_a?(URI) ? url : URI.parse(url) - headers = headers ? @request_headers.merge(headers) : @request_headers pool_for(url).use do |client| client.exec(method.to_s.upcase, uri.full_path, headers: headers) end @@ -56,7 +48,7 @@ module Arachnid def pool_for(uri : URI) if host = uri.host session_pools[host] ||= ConnectionPool(HTTP::Client).new(capacity: @max_pool_size, initial: @initial_pool_size, timeout: @connection_timeout.total_seconds) do - @base_client.new(host.to_s, tls: @tls_context) + @base_client.new(host.to_s, tls: @tls_contexts[host.to_s]? || uri.scheme == "https") end else raise "Invalid URI" # TODO: Real error handling diff --git a/src/arachnid/resource.cr b/src/arachnid/resource.cr new file mode 100644 index 0000000..7888346 --- /dev/null +++ b/src/arachnid/resource.cr @@ -0,0 +1,45 @@ +require "./resource/*" +require "./resource/includes/*" + +module Arachnid + class Resource + include Cookies + include StatusCodes + include ContentTypes + + getter uri : URI + + getter response : HTTP::Client::Response + + def initialize(uri, response) + @uri = uri.is_a?(URI) ? uri : URI.parse(uri) + @response = response + end + + # Create a resource based on the Content-Type header + # of the resource. + def self.from_content_type(uri, response) + headers = response.headers + case headers.fetch("Content-Type", nil) + when /html/ + return Resource::HTML.new(uri, response) + when /xml/ + return Resource::XML.new(uri, response) + when /image/ + return Resource::Image.new(uri, response) + when /stylesheet|css/ + return Resource::Stylesheet.new(uri, response) + when /javascript/ + return Resource::Script.new(uri, response) + else + Log.debug { "No resource for content type '#{headers["Content-Type"]?}'" } + return Resource.new(uri, response) + end + end + + # Save this resource to a file + def save(path) + File.write(path, @response.body) + end + end +end diff --git a/src/arachnid/resource/html.cr b/src/arachnid/resource/html.cr new file mode 100644 index 0000000..a23f194 --- /dev/null +++ b/src/arachnid/resource/html.cr @@ -0,0 +1,204 @@ +module Arachnid + class Resource + # Represents a parsed HTML page + class HTML < Resource + @parser : Myhtml::Parser + + delegate :body, :body!, :head, :head!, :root, :root!, :html, :html!, :document!, + :nodes, :css, :to_html, :to_pretty_html, :encoding, to: @parser + + def initialize(uri, response) + super(uri, response) + @parser = Myhtml::Parser.new(response.body, detect_encoding_from_meta: true) + end + + def title + titles = css("title") + if titles.size > 0 + titles.first.inner_text + else + "" + end + end + + def each_meta_redirect(&block : URI ->) + css("meta[http-equiv=\"refresh\"]").each do |tag| + if content = tag.attribute_by("content") + if (redirect = content.match(/url=(\S+)$/)) + uri = @uri.resolve(redirect[1]) + yield uri + end + end + end + end + + def meta_redirects + redirects = [] of URI + each_meta_redirect { |uri| redirects << uri } + redirects + end + + def meta_redirect? + !meta_redirects.empty? + end + + def each_redirect(&block : URI ->) + redirects.each do |uri| + block.call(uri) + end + end + + def redirects + location = @response.headers.fetch("Location", nil) + locations = [location].compact.map { |l| @uri.resolve(l) } + locations + meta_redirects + end + + def each_mailto(&block : String ->) + css("a[href^=\"mailto:\"]").each do |tag| + if content = tag.attribute_by("href") + if match = content.match("mailto:(.*)") + yield match[1] + end + end + end + end + + def mailtos + mailtos = [] of String + each_mailto { |uri| mailtos << uri } + mailtos + end + + def each_link(&block : URI ->) + css("a").each do |tag| + if href = tag.attribute_by("href") + unless href.match(/^(javascript|mailto|tel)/) + uri = @uri.resolve(href) + block.call(uri) if uri.host + end + end + end + end + + def links + links = [] of URI + each_link { |uri| links << uri } + links + end + + def each_image(&block : URI ->) + css("img").each do |tag| + if src = tag.attribute_by("src") + uri = @uri.resolve(src) + yield uri + end + + if srcset = tag.attribute_by("srcset") + parts = srcset.split(",") + parts.each do |set| + url = set.split(/\s+/).first + uri = @uri.resolve(url) + yield uri + end + end + end + end + + def images + images = [] of URI + each_image { |uri| images << uri } + images + end + + def each_video(&block : URI ->) + css("video, video source").each do |tag| + if src = tag.attribute_by("src") + uri = @uri.resolve(src) + yield uri + end + end + end + + def videos + videos = [] of URI + each_video { |uri| videos << uri } + videos + end + + def each_script(&block : URI ->) + css("script").each do |tag| + if src = tag.attribute_by("src") + uri = @uri.resolve(src) + yield uri + end + end + end + + def scripts + scripts = [] of URI + each_script { |uri| scripts << uri } + scripts + end + + def each_resource(&block : URI ->) + css("link").each do |tag| + if href = tag.attribute_by("href") + uri = @uri.resolve(href) + yield uri + end + end + end + + def resources + resources = [] of URI + each_resource { |uri| resources << uri } + resources + end + + def each_frame(&block : URI ->) + css("frame").each do |tag| + if src = tag.attribute_by("src") + uri = @uri.resolve(src) + yield uri + end + end + end + + def frames + frames = [] of URI + each_frame { |uri| frames << uri } + frames + end + + def each_iframe(&block : URI ->) + css("iframe").each do |tag| + if src = tag.attribute_by("src") + uri = @uri.resolve(src) + yield uri + end + end + end + + def iframes + iframes = [] of URI + each_iframe { |uri| iframes << uri } + iframes + end + + def each_url(&block : URI ->) + urls.each do |uri| + yield uri + end + end + + def urls + links + redirects + images + videos + scripts + resources + frames + iframes + end + + def save(path) + File.write(path, @parser.to_pretty_html) + end + end + end +end diff --git a/src/arachnid/resource/image.cr b/src/arachnid/resource/image.cr new file mode 100644 index 0000000..90385ed --- /dev/null +++ b/src/arachnid/resource/image.cr @@ -0,0 +1,6 @@ +module Arachnid + class Resource + class Image < Resource + end + end +end diff --git a/src/arachnid/resource/includes/content_types.cr b/src/arachnid/resource/includes/content_types.cr new file mode 100644 index 0000000..a24aa75 --- /dev/null +++ b/src/arachnid/resource/includes/content_types.cr @@ -0,0 +1,162 @@ +module Arachnid + class Resource + module ContentTypes + # The Content-Type of the resource. + def content_type + @response.content_type || "" + end + + # The content types of the resource. + def content_types + types = @response.headers.get?("content-type") || [] of String + end + + # The charset included in the Content-Type. + def content_charset + content_types.each do |value| + if value.includes?(";") + value.split(";").each do |param| + param.strip! + + if param.starts_with?("charset=") + return param.split("=", 2).last + end + end + end + end + + return nil + end + + # Determines if any of the content-types of the resource include a given + # type. + def is_content_type?(type : String | Regex) + content_types.any? do |value| + value = value.split(";", 2).first + + if type.is_a?(Regex) + value =~ type + else + value == type + end + end + end + + # Determines if the resource is plain-text. + def plain_text? + is_content_type?("text/plain") + end + + # ditto + def text? + plain_text? + end + + # Determines if the resource is a Directory Listing. + def directory? + is_content_type?("text/directory") + end + + # Determines if the resource is HTML document. + def html? + is_content_type?("text/html") + end + + # Determines if the resource is XML document. + def xml? + is_content_type?(/(text|application)\/xml/) + end + + # Determines if the resource is XML Stylesheet (XSL). + def xsl? + is_content_type?("text/xsl") + end + + # Determines if the resource is JavaScript. + def javascript? + is_content_type?(/(text|application)\/javascript/) + end + + # Determines if the resource is JSON. + def json? + is_content_type?("application/json") + end + + # Determines if the resource is a CSS stylesheet. + def css? + is_content_type?("text/css") + end + + # Determines if the resource is a RSS feed. + def rss? + is_content_type?(/application\/(rss\+xml|rdf\+xml)/) + end + + # Determines if the resource is an Atom feed. + def atom? + is_content_type?("application/atom+xml") + end + + # Determines if the resource is a MS Word document. + def ms_word? + is_content_type?("application/msword") + end + + # Determines if the resource is a PDF document. + def pdf? + is_content_type?("application/pdf") + end + + # Determines if the resource is a ZIP archive. + def zip? + is_content_type?("application/zip") + end + + # Determine if the resource is an image. + def image? + is_content_type?(/image\//) + end + + def png? + is_content_type?("image/png") + end + + def gif? + is_content_type?("image/gif") + end + + def jpg? + is_content_type?(/image\/(jpg|jpeg)/) + end + + def svg? + is_content_type?(/image\/svg(\+xml)?/) + end + + def video? + is_content_type?(/video\/.*/) + end + + def mp4? + is_content_type?("video/mp4") + end + + def avi? + is_content_type?("video/x-msvideo") + end + + def wmv? + is_content_type?("video/x-ms-wmv") + end + + def quicktime? + is_content_type?("video/quicktime") + end + + def flash? + is_content_type?("video/flash") || + is_content_type?("application/x-shockwave-flash") + end + end + end +end diff --git a/src/arachnid/resource/includes/cookies.cr b/src/arachnid/resource/includes/cookies.cr new file mode 100644 index 0000000..69273f4 --- /dev/null +++ b/src/arachnid/resource/includes/cookies.cr @@ -0,0 +1,18 @@ +module Arachnid + class Resource + module Cookies + # Reserved names used within Cookie strings + RESERVED_COOKIE_NAMES = Regex.new("^(?:Path|Expires|Domain|Secure|HTTPOnly)$", :ignore_case) + + # The raw Cookie String sent along with the resource. + def cookie + @response.headers["Set-Cookie"]? || "" + end + + # The Cookie values sent along with the resource. + def cookies + @response.cookies + end + end + end +end diff --git a/src/arachnid/resource/includes/status_codes.cr b/src/arachnid/resource/includes/status_codes.cr new file mode 100644 index 0000000..b1026eb --- /dev/null +++ b/src/arachnid/resource/includes/status_codes.cr @@ -0,0 +1,59 @@ +module Arachnid + class Resource + module StatusCodes + # The response code from the resource. + def code + @response.status_code.to_i + end + + # Determines if the response code is `200`. + def ok? + code == 200 + end + + # Determines if the response code is `308`. + def timedout? + code == 308 + end + + # Determines if the response code is `400`. + def bad_request? + code == 400 + end + + # Determines if the response code is `401`. + def unauthorized? + code == 401 + end + + # Determines if the response code is `403`. + def forbidden? + code == 403 + end + + # Determines if the response code is `404`. + def missing? + code == 404 + end + + # Determines if the response code is `500`. + def had_internal_server_error? + code == 500 + end + + # Determines if the response code is `300`, `301`, `302`, `303` + # or `307`. Also checks for "soft" redirects added at the resource + # level by a meta refresh tag. + def redirect? + case code + when 300..303, 307 + true + when 200 + meta_redirect? + else + false + end + end + end + end +end diff --git a/src/arachnid/resource/script.cr b/src/arachnid/resource/script.cr new file mode 100644 index 0000000..9a910ee --- /dev/null +++ b/src/arachnid/resource/script.cr @@ -0,0 +1,6 @@ +module Arachnid + class Resource + class Script < Resource + end + end +end diff --git a/src/arachnid/resource/stylesheet.cr b/src/arachnid/resource/stylesheet.cr new file mode 100644 index 0000000..05453eb --- /dev/null +++ b/src/arachnid/resource/stylesheet.cr @@ -0,0 +1,6 @@ +module Arachnid + class Resource + class Stylesheet < Resource + end + end +end diff --git a/src/arachnid/resource/xml.cr b/src/arachnid/resource/xml.cr new file mode 100644 index 0000000..93dd8ad --- /dev/null +++ b/src/arachnid/resource/xml.cr @@ -0,0 +1,22 @@ +require "xml" + +module Arachnid + class Resource + class XML < Resource + @document : ::XML::Node + + delegate :==, :[], :[]=, :[]?, :attribute?, :attributes, :cdata, :children, :comment?, :content, + :content=, :delete, :document, :document?, :element, :encoding, :errors, :first_element_child, + :fragment?, :hash, :inner_text, :inspect, :name, :name=, :namespace, :namespace_scopes, :next, + :next_element, :next_sibling, :object_id, :parent, :previous, :previous_element, :previous_sibling, + :processing_instruction, :root, :text, :text=, :text, :to_s, :to_unsafe, :to_xml, :type, :unlink, + :version, :xml?, :xpath, :xpath_bool, :xpath_float, :xpath_node, :xpath_nodes, :xpath_string, + to: @document + + def initialize(uri, response) + super(uri, response) + @document = ::XML.parse(response.body) + end + end + end +end diff --git a/src/ext/uri.cr b/src/ext/uri.cr new file mode 100644 index 0000000..6c9138b --- /dev/null +++ b/src/ext/uri.cr @@ -0,0 +1,111 @@ +class URI + def split_path(path) + path.split("/") + end + + def merge_path(base, rel) + + # RFC2396, Section 5.2, 5) + # RFC2396, Section 5.2, 6) + base_path = split_path(base) + rel_path = split_path(rel) + + # RFC2396, Section 5.2, 6), a) + base_path << "" if base_path.last == ".." + while i = base_path.index("..") + base_path = base_path[i - 1, 2] + end + + if (first = rel_path.first) && first.empty? + base_path.clear + rel_path.shift + end + + # RFC2396, Section 5.2, 6), c) + # RFC2396, Section 5.2, 6), d) + rel_path.push("") if rel_path.last == '.' || rel_path.last == ".." + rel_path.delete('.') + + # RFC2396, Section 5.2, 6), e) + tmp = [] of String + rel_path.each do |x| + if x == ".." && + !(tmp.empty? || tmp.last == "..") + tmp.pop + else + tmp << x + end + end + + add_trailer_slash = !tmp.empty? + if base_path.empty? + base_path = [""] # keep '/' for root directory + elsif add_trailer_slash + base_path.pop + end + while x = tmp.shift + if x == ".." + # RFC2396, Section 4 + # a .. or . in an absolute path has no special meaning + base_path.pop if base_path.size > 1 + else + # if x == ".." + # valid absolute (but abnormal) path "/../..." + # else + # valid absolute path + # end + base_path << x + tmp.each {|t| base_path << t} + add_trailer_slash = false + break + end + end + base_path.push("") if add_trailer_slash + + return base_path.join('/') + end + + def merge(oth) + oth = URI.parse(oth) unless oth.is_a?(URI) + + if oth.absolute? + # raise BadURIError, "both URI are absolute" if absolute? + # hmm... should return oth for usability? + return oth + end + + unless self.absolute? + raise URI::Error.new("both URI are othative") + end + + base = self.dup + + authority = oth.userinfo || oth.host || oth.port + + # RFC2396, Section 5.2, 2) + if (oth.path.nil? || oth.path.empty?) && !authority && !oth.query + base.fragment=(oth.fragment) if oth.fragment + return base + end + + base.query = nil + base.fragment=(nil) + + # RFC2396, Section 5.2, 4) + if !authority + base.path = merge_path(base.path, oth.path) if base.path && oth.path + else + # RFC2396, Section 5.2, 4) + base.path = oth.path if oth.path + end + + # RFC2396, Section 5.2, 7) + base.user = oth.userinfo if oth.userinfo + base.host = oth.host if oth.host + base.port = oth.port if oth.port + base.query = oth.query if oth.query + base.fragment=(oth.fragment) if oth.fragment + + return base + end +end