Semi-finished with rewrite

This commit is contained in:
Chris Watson 2020-06-20 23:09:57 -06:00
parent 03f561b480
commit 604d8996ea
19 changed files with 1049 additions and 32 deletions

142
README.md
View File

@ -1,6 +1,6 @@
# arachnid
# Arachnid
TODO: Write a description here
Arachnid is a fast, soon to be multi-threading capable web crawler for Crystal. It recenty underwent a full rewrite for Crystal 0.35.1, so see the documentation below for updated usage instructions.
## Installation
@ -9,26 +9,152 @@ TODO: Write a description here
```yaml
dependencies:
arachnid:
github: your-github-user/arachnid
github: watzon/arachnid
```
2. Run `shards install`
## Usage
First, of course, you need to require arachnid in your project:
```crystal
require "arachnid"
```
TODO: Write usage instructions here
### The Agent
## Development
`Agent` is the class that does all the heavy lifting and will be the main one you interact with. To create a new `Agent`, use `Agent.new`.
TODO: Write development instructions here
```crystal
agent = Arachnid::Agent.new
```
The initialize method takes a bunch of optional parameters:
#### `:client`
You can, if you wish, supply your own `HTTP::Client` instance to the `Agent`. This can be useful if you want to use a proxy, provided the proxy client extends `HTTP::Client`.
#### `:user_agent`
The user agent to be added to every request header. You can override this on a per-host basis with either `:host_headers` or `:default_headers`.
#### `:default_headers`
The default headers to be used in every request.
#### `:host_headers`
Headers to be applied on a per-host basis. This is a hash `String (host name) => HTTP::Headers`.
#### `:queue`
The `Arachnid::Queue` instance to use for storing links waiting to be processed. The default is a `MemoryQueue` (which is the only one for now), but you can easily implement your own `Queue` using whatever you want as a backend.
#### `:stop_on_empty`
Whether or not to stop running when the queue is empty. This is true by default. If it's made false, the loop will continue when the queue empties, so be sure you have a way to keep adding items to the queue.
#### `:follow_redirects`
Whether or not to follow redirects (add them to the queue).
### Starting the Agent
There are four ways to start your Agent once it's been created. Here are some examples:
#### `#start_at`
`#start_at` starts the Agent running on a particular URL. It adds a single URL to the queue and starts there.
```crystal
agent.start_at("https://crystal-lang.org") do
# ...
end
```
#### `#site`
`#site` starts the agent running at the given URL and adds a rule that keeps the agent restricted to the given site. This allows the agent to scan the given domain and any subdomains. For instance:
```crystal
agent.site("https://crystal-lang.org") do
# ...
end
```
The above will match `crystal-lang.org` and `forum.crystal-lang.org`, but not `github.com/crystal-lang` or any other site not within the `*.crystal-lang.org` space.
#### `#host`
`#host` is like site, but with the added restriction of just remaining on the current domain path. Subdomains are not included.
```crystal
agent.host("crystal-lang.org") do
# ...
end
```
#### `#start`
Provided you already have URIs in the queue ready to be scanned, you can also just use `#start` to start the Agent running.
```crystal
agent.enqueue("https://crystal-lang.org")
agent.enqueue("https://kemalcr.com")
agent.start
```
### Filters
URI's can be filtered before being enqueued. There are two kinds of filters, accept and reject. Accept filters can be used to ensure that a URI matches before being enqueued. Reject filters do the opposite, keeping URIs from being enqueued if they _do_ match.
For instance:
```crystal
# This will filter out all sites where the host is not "crystal-lang.org"
agent.accept_filter { |uri| uri.host == "crystal-lang.org" }
```
If you want to ignore certain parts of the above filter:
```crystal
# This will ignore paths starting with "/api"
agent.reject_filter { |uri| uri.path.to_s.starts_with?("/api") }
```
The `#site` and `#host` methods add a default accept filter in order to keep things in the given site or host.
### Resources
All the above is useless if you can't do anything with the scanned resources, which is why we have the `Resource` class. Every scanned resource is converted into a `Resource` (or subclass) based on the content type. For instance, `text/html` becomes a `Resource::HTML` which is parsed using [kostya/myhtml](https://github.com/kostya/myhtml) for extra speed.
Each resource has an associated `Agent#on_` method so you can do something when one of those resources is scanned:
```crystal
agent.on_html do |page|
puts typeof(page)
# => Arachnid::Resource::HTML
puts page.title
# => The Title of the Page
end
```
Currently we have:
- `#on_html`
- `#on_image`
- `#on_script`
- `#on_stylesheet`
- `#on_xml`
There is also `#on_resource` which is called for every resource, including ones that don't match the above types. Resources all include, at minimum the URI at which the resource was found, and the response (`HTTP::Client::Response`) instance.
## Contributing
1. Fork it (<https://github.com/your-github-user/arachnid/fork>)
1. Fork it (<https://github.com/watzon/arachnid/fork>)
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
@ -36,4 +162,4 @@ TODO: Write development instructions here
## Contributors
- [your-name-here](https://github.com/your-github-user) - creator and maintainer
- [your-name-here](https://github.com/watzon) - creator and maintainer

View File

@ -8,6 +8,9 @@ dependencies:
pool:
github: watzon/pool
branch: master
myhtml:
github: kostya/myhtml
branch: master
crystal: 0.35.0

View File

@ -1,5 +1,11 @@
require "http/client"
require "myhtml"
require "pool/connection"
require "./ext/*"
require "./arachnid/*"
module Arachnid
include Logger
end

View File

@ -1,15 +1,193 @@
require "./resource"
module Arachnid
class Agent
DEFAULT_USER_AGENT = "Arachnid #{Arachnid::VERSION} for Crystal #{Crystal::VERSION}"
getter request_handler : RequestHandler
getter accept_filters : Array(Proc(URI, Bool))
getter reject_filters : Array(Proc(URI, Bool))
getter? running : Bool
property queue : Queue
property default_headers : HTTP::Headers
property host_headers : Hash(String, HTTP::Headers)
property? stop_on_empty : Bool
property? follow_redirects : Bool
def initialize(client : (HTTP::Client.class)? = nil,
request_headers = HTTP::Headers.new,
user_agent = DEFAULT_USER_AGENT)
user_agent = DEFAULT_USER_AGENT,
default_headers = HTTP::Headers.new,
host_headers = {} of String => HTTP::Headers,
queue = Queue::Memory.new,
stop_on_empty = true,
follow_redirects = true)
client ||= HTTP::Client
request_headers["User-Agent"] ||= user_agent
@request_handler = RequestHandler.new(client, request_headers)
@request_handler = RequestHandler.new(client)
@queue = queue.is_a?(Array) ? Queue::Memory.new : queue
@user_agent = user_agent
@default_headers = default_headers
@host_headers = host_headers
@stop_on_empty = stop_on_empty
@follow_redirects = follow_redirects
@accept_filters = [] of Proc(URI, Bool)
@reject_filters = [] of Proc(URI, Bool)
@running = false
end
def start_at(uri, &block)
uri = ensure_scheme(uri)
enqueue(uri, force: true)
with self yield self
start
end
def site(site, &block)
uri = ensure_scheme(site)
enqueue(uri, force: true)
accept_filter { |u| u.host.to_s.ends_with?(uri.host.to_s) }
with self yield self
start
end
def host(host, &block)
uri = ensure_scheme(host)
enqueue(uri, force: true)
accept_filter { |u| u.host == uri.host }
with self yield self
start
end
def stop
@running = false
end
def start
@running = true
while @running
break if stop_on_empty? && @queue.empty?
unless @queue.empty?
next_uri = @queue.dequeue
Log.debug { "Scanning #{next_uri.to_s}" }
headers = build_headers_for(next_uri)
response = @request_handler.request(:get, next_uri, headers: headers)
resource = Resource.from_content_type(next_uri, response)
@resource_handlers.each do |handler|
handler.call(resource)
end
# Call the registered handlers for this resource
{% begin %}
case resource
{% for subclass in Arachnid::Resource.subclasses %}
{% resname = subclass.name.split("::").last.downcase.underscore.id %}
when {{ subclass.id }}
@{{ resname }}_handlers.each do |handler|
handler.call(resource)
end
{% end %}
end
{% end %}
# If the resource has an each_uel method let's pull out
# it's urls and enqueue all of them.
if resource.responds_to?(:each_url)
resource.each_url do |uri|
enqueue(uri)
end
end
# Check for redirects
if (300..399).includes?(response.status_code)
if location = response.headers["Location"]?
uri = URI.parse(location)
@queue.enqueue(uri)
end
end
end
end
end
def enqueue(uri, force = false)
uri = ensure_scheme(uri)
if force
@queue.enqueue(uri)
elsif !@queue.includes?(uri) && filter(uri)
@queue.enqueue(uri)
end
end
def accept_filter(&block : URI -> Bool)
@accept_filters << block
end
def reject_filter(&block : URI -> Bool)
@reject_filters << block
end
@resource_handlers = [] of Proc(Resource, Nil)
def on_resource(&block : Resource ->)
@resource_handlers << block
end
{% for subclass in Arachnid::Resource.subclasses %}
{% resname = subclass.name.split("::").last.downcase.underscore.id %}
@{{ resname }}_handlers = [] of Proc({{ subclass.id }}, Nil)
# Create a handler for when a {{ subclass.id }} resource is found. The
# resource will be passed to the block.
def on_{{ resname }}(&block : {{ subclass.id }} ->)
@{{ resname }}_handlers << block
end
{% end %}
private def build_headers_for(uri)
headers = @default_headers.dup
headers["User-Agent"] ||= @user_agent
if host_headers = @host_headers[uri.host.to_s]?
headers.merge!(host_headers)
end
# TODO: Authorization and Cookies
headers
end
private def filter(uri)
return true if @accept_filters.empty? && @reject_filters.empty?
return false unless !@reject_filters.empty? && !@reject_filters.any?(&.call(uri))
return false unless @accept_filters.empty? || @accept_filters.any?(&.call(uri))
true
end
private def ensure_scheme(uri : URI | String)
if uri.is_a?(URI)
if uri.scheme.nil? || uri.scheme.to_s.empty?
uri.scheme = "http"
end
else
if !uri.starts_with?("http")
uri = "http://#{uri}"
end
uri = URI.parse(uri)
end
uri.as(URI)
end
end
end

View File

@ -1,7 +0,0 @@
require "http/client"
module Arachnid
module HTTPClient
abstract def exec(method : String, path, headers : HTTP::Headers? = nil, body : HTTP::Client::BodyType = nil) : HTTP::Client::Response
end
end

10
src/arachnid/logger.cr Normal file
View File

@ -0,0 +1,10 @@
module Arachnid
module Logger
macro included
{% begin %}
{% tname = @type.name.stringify.split("::").map(&.underscore).join(".") %}
Log = ::Log.for({{ tname }})
{% end %}
end
end
end

22
src/arachnid/queue.cr Normal file
View File

@ -0,0 +1,22 @@
module Arachnid
# Abstract base class for URL queues. Within Arachnid itself `Queue` implementations
# strive to be thread safe, and custom implementations should as well.
abstract class Queue
# Add a new URL to the queue.
abstract def enqueue(uri : URI)
# Remove a URL from the queue.
abstract def dequeue : URI
# Check if the queue is empty.
abstract def empty? : Bool
# Check if a URL has been enqueued.
abstract def includes?(uri : URI) : Bool
# Clear the queue, removing all items.
abstract def clear
end
end
require "./queue/*"

View File

@ -0,0 +1,48 @@
module Arachnid
abstract class Queue
# A basic, thread safe, queue implementation that stores all URLs in memory.
class Memory < Queue
@queue : Deque(URI)
@history : Set(String)
@mutex : Mutex
def initialize(queue = Deque(URI).new)
@queue = queue.is_a?(Deque) ? queue : Deque.new(queue.to_a)
@history = Set(String).new
@mutex = Mutex.new
end
def enqueue(uri : URI)
@mutex.synchronize do
@history << "#{uri.host}#{uri.path}"
@queue << uri
end
end
def dequeue : URI
@mutex.synchronize do
@queue.shift
end
end
def empty? : Bool
@mutex.synchronize do
@queue.empty?
end
end
def includes?(uri : URI) : Bool
@mutex.synchronize do
@history.includes?("#{uri.host}#{uri.path}")
end
end
def clear
@mutex.synchronize do
@queue.clear
@history.clear
end
end
end
end
end

View File

@ -1,5 +1,3 @@
require "pool/connection"
module Arachnid
# Class for handling multiple simultanious requests for different hosts. Each host maintains it's own
# dedicated pool of HTTP clients to pick from when needed, so as to keep things thread safe.
@ -10,9 +8,6 @@ module Arachnid
# providing initializers as class variables.
property base_client : HTTP::Client.class
# Any headers that should be sent on every request.
property request_headers : HTTP::Headers
# The maximum number of pools items to store per host. This will be the maximum number
# of concurrent connections that any one host can have at a time.
property max_pool_size : Int32
@ -23,9 +18,8 @@ module Arachnid
# The maximum amount of time to wait for a request to finish before raising an `IO::TimeoutError`.
property connection_timeout : Time::Span
# A client specific TLS context instance.
# TODO: Allow this to be unique to each host.
property tls_context : HTTP::Client::TLSContext
# A map of host name to TLS context for that host
property tls_contexts : Hash(String, HTTP::Client::TLSContext)
# A map of host name to connection pool. If `max_hosts` is a non-nil value, this hash will
# be limited in size to that number, with older hosts being deleted to save on
@ -34,8 +28,7 @@ module Arachnid
# Create a new `RequestHandler` instance.
def initialize(@base_client,
@request_headers,
@tls_context : HTTP::Client::TLSContext = nil,
@tls_contexts = {} of String => HTTP::Client::TLSContext,
@max_pool_size = 10,
@initial_pool_size = 1,
@connection_timeout = 1.second)
@ -46,7 +39,6 @@ module Arachnid
# throw an `IO::TimeoutError` if a request is made and a new client isn't fetched in time.
def request(method, url : String | URI, headers = nil)
uri = url.is_a?(URI) ? url : URI.parse(url)
headers = headers ? @request_headers.merge(headers) : @request_headers
pool_for(url).use do |client|
client.exec(method.to_s.upcase, uri.full_path, headers: headers)
end
@ -56,7 +48,7 @@ module Arachnid
def pool_for(uri : URI)
if host = uri.host
session_pools[host] ||= ConnectionPool(HTTP::Client).new(capacity: @max_pool_size, initial: @initial_pool_size, timeout: @connection_timeout.total_seconds) do
@base_client.new(host.to_s, tls: @tls_context)
@base_client.new(host.to_s, tls: @tls_contexts[host.to_s]? || uri.scheme == "https")
end
else
raise "Invalid URI" # TODO: Real error handling

45
src/arachnid/resource.cr Normal file
View File

@ -0,0 +1,45 @@
require "./resource/*"
require "./resource/includes/*"
module Arachnid
class Resource
include Cookies
include StatusCodes
include ContentTypes
getter uri : URI
getter response : HTTP::Client::Response
def initialize(uri, response)
@uri = uri.is_a?(URI) ? uri : URI.parse(uri)
@response = response
end
# Create a resource based on the Content-Type header
# of the resource.
def self.from_content_type(uri, response)
headers = response.headers
case headers.fetch("Content-Type", nil)
when /html/
return Resource::HTML.new(uri, response)
when /xml/
return Resource::XML.new(uri, response)
when /image/
return Resource::Image.new(uri, response)
when /stylesheet|css/
return Resource::Stylesheet.new(uri, response)
when /javascript/
return Resource::Script.new(uri, response)
else
Log.debug { "No resource for content type '#{headers["Content-Type"]?}'" }
return Resource.new(uri, response)
end
end
# Save this resource to a file
def save(path)
File.write(path, @response.body)
end
end
end

View File

@ -0,0 +1,204 @@
module Arachnid
class Resource
# Represents a parsed HTML page
class HTML < Resource
@parser : Myhtml::Parser
delegate :body, :body!, :head, :head!, :root, :root!, :html, :html!, :document!,
:nodes, :css, :to_html, :to_pretty_html, :encoding, to: @parser
def initialize(uri, response)
super(uri, response)
@parser = Myhtml::Parser.new(response.body, detect_encoding_from_meta: true)
end
def title
titles = css("title")
if titles.size > 0
titles.first.inner_text
else
""
end
end
def each_meta_redirect(&block : URI ->)
css("meta[http-equiv=\"refresh\"]").each do |tag|
if content = tag.attribute_by("content")
if (redirect = content.match(/url=(\S+)$/))
uri = @uri.resolve(redirect[1])
yield uri
end
end
end
end
def meta_redirects
redirects = [] of URI
each_meta_redirect { |uri| redirects << uri }
redirects
end
def meta_redirect?
!meta_redirects.empty?
end
def each_redirect(&block : URI ->)
redirects.each do |uri|
block.call(uri)
end
end
def redirects
location = @response.headers.fetch("Location", nil)
locations = [location].compact.map { |l| @uri.resolve(l) }
locations + meta_redirects
end
def each_mailto(&block : String ->)
css("a[href^=\"mailto:\"]").each do |tag|
if content = tag.attribute_by("href")
if match = content.match("mailto:(.*)")
yield match[1]
end
end
end
end
def mailtos
mailtos = [] of String
each_mailto { |uri| mailtos << uri }
mailtos
end
def each_link(&block : URI ->)
css("a").each do |tag|
if href = tag.attribute_by("href")
unless href.match(/^(javascript|mailto|tel)/)
uri = @uri.resolve(href)
block.call(uri) if uri.host
end
end
end
end
def links
links = [] of URI
each_link { |uri| links << uri }
links
end
def each_image(&block : URI ->)
css("img").each do |tag|
if src = tag.attribute_by("src")
uri = @uri.resolve(src)
yield uri
end
if srcset = tag.attribute_by("srcset")
parts = srcset.split(",")
parts.each do |set|
url = set.split(/\s+/).first
uri = @uri.resolve(url)
yield uri
end
end
end
end
def images
images = [] of URI
each_image { |uri| images << uri }
images
end
def each_video(&block : URI ->)
css("video, video source").each do |tag|
if src = tag.attribute_by("src")
uri = @uri.resolve(src)
yield uri
end
end
end
def videos
videos = [] of URI
each_video { |uri| videos << uri }
videos
end
def each_script(&block : URI ->)
css("script").each do |tag|
if src = tag.attribute_by("src")
uri = @uri.resolve(src)
yield uri
end
end
end
def scripts
scripts = [] of URI
each_script { |uri| scripts << uri }
scripts
end
def each_resource(&block : URI ->)
css("link").each do |tag|
if href = tag.attribute_by("href")
uri = @uri.resolve(href)
yield uri
end
end
end
def resources
resources = [] of URI
each_resource { |uri| resources << uri }
resources
end
def each_frame(&block : URI ->)
css("frame").each do |tag|
if src = tag.attribute_by("src")
uri = @uri.resolve(src)
yield uri
end
end
end
def frames
frames = [] of URI
each_frame { |uri| frames << uri }
frames
end
def each_iframe(&block : URI ->)
css("iframe").each do |tag|
if src = tag.attribute_by("src")
uri = @uri.resolve(src)
yield uri
end
end
end
def iframes
iframes = [] of URI
each_iframe { |uri| iframes << uri }
iframes
end
def each_url(&block : URI ->)
urls.each do |uri|
yield uri
end
end
def urls
links + redirects + images + videos + scripts + resources + frames + iframes
end
def save(path)
File.write(path, @parser.to_pretty_html)
end
end
end
end

View File

@ -0,0 +1,6 @@
module Arachnid
class Resource
class Image < Resource
end
end
end

View File

@ -0,0 +1,162 @@
module Arachnid
class Resource
module ContentTypes
# The Content-Type of the resource.
def content_type
@response.content_type || ""
end
# The content types of the resource.
def content_types
types = @response.headers.get?("content-type") || [] of String
end
# The charset included in the Content-Type.
def content_charset
content_types.each do |value|
if value.includes?(";")
value.split(";").each do |param|
param.strip!
if param.starts_with?("charset=")
return param.split("=", 2).last
end
end
end
end
return nil
end
# Determines if any of the content-types of the resource include a given
# type.
def is_content_type?(type : String | Regex)
content_types.any? do |value|
value = value.split(";", 2).first
if type.is_a?(Regex)
value =~ type
else
value == type
end
end
end
# Determines if the resource is plain-text.
def plain_text?
is_content_type?("text/plain")
end
# ditto
def text?
plain_text?
end
# Determines if the resource is a Directory Listing.
def directory?
is_content_type?("text/directory")
end
# Determines if the resource is HTML document.
def html?
is_content_type?("text/html")
end
# Determines if the resource is XML document.
def xml?
is_content_type?(/(text|application)\/xml/)
end
# Determines if the resource is XML Stylesheet (XSL).
def xsl?
is_content_type?("text/xsl")
end
# Determines if the resource is JavaScript.
def javascript?
is_content_type?(/(text|application)\/javascript/)
end
# Determines if the resource is JSON.
def json?
is_content_type?("application/json")
end
# Determines if the resource is a CSS stylesheet.
def css?
is_content_type?("text/css")
end
# Determines if the resource is a RSS feed.
def rss?
is_content_type?(/application\/(rss\+xml|rdf\+xml)/)
end
# Determines if the resource is an Atom feed.
def atom?
is_content_type?("application/atom+xml")
end
# Determines if the resource is a MS Word document.
def ms_word?
is_content_type?("application/msword")
end
# Determines if the resource is a PDF document.
def pdf?
is_content_type?("application/pdf")
end
# Determines if the resource is a ZIP archive.
def zip?
is_content_type?("application/zip")
end
# Determine if the resource is an image.
def image?
is_content_type?(/image\//)
end
def png?
is_content_type?("image/png")
end
def gif?
is_content_type?("image/gif")
end
def jpg?
is_content_type?(/image\/(jpg|jpeg)/)
end
def svg?
is_content_type?(/image\/svg(\+xml)?/)
end
def video?
is_content_type?(/video\/.*/)
end
def mp4?
is_content_type?("video/mp4")
end
def avi?
is_content_type?("video/x-msvideo")
end
def wmv?
is_content_type?("video/x-ms-wmv")
end
def quicktime?
is_content_type?("video/quicktime")
end
def flash?
is_content_type?("video/flash") ||
is_content_type?("application/x-shockwave-flash")
end
end
end
end

View File

@ -0,0 +1,18 @@
module Arachnid
class Resource
module Cookies
# Reserved names used within Cookie strings
RESERVED_COOKIE_NAMES = Regex.new("^(?:Path|Expires|Domain|Secure|HTTPOnly)$", :ignore_case)
# The raw Cookie String sent along with the resource.
def cookie
@response.headers["Set-Cookie"]? || ""
end
# The Cookie values sent along with the resource.
def cookies
@response.cookies
end
end
end
end

View File

@ -0,0 +1,59 @@
module Arachnid
class Resource
module StatusCodes
# The response code from the resource.
def code
@response.status_code.to_i
end
# Determines if the response code is `200`.
def ok?
code == 200
end
# Determines if the response code is `308`.
def timedout?
code == 308
end
# Determines if the response code is `400`.
def bad_request?
code == 400
end
# Determines if the response code is `401`.
def unauthorized?
code == 401
end
# Determines if the response code is `403`.
def forbidden?
code == 403
end
# Determines if the response code is `404`.
def missing?
code == 404
end
# Determines if the response code is `500`.
def had_internal_server_error?
code == 500
end
# Determines if the response code is `300`, `301`, `302`, `303`
# or `307`. Also checks for "soft" redirects added at the resource
# level by a meta refresh tag.
def redirect?
case code
when 300..303, 307
true
when 200
meta_redirect?
else
false
end
end
end
end
end

View File

@ -0,0 +1,6 @@
module Arachnid
class Resource
class Script < Resource
end
end
end

View File

@ -0,0 +1,6 @@
module Arachnid
class Resource
class Stylesheet < Resource
end
end
end

View File

@ -0,0 +1,22 @@
require "xml"
module Arachnid
class Resource
class XML < Resource
@document : ::XML::Node
delegate :==, :[], :[]=, :[]?, :attribute?, :attributes, :cdata, :children, :comment?, :content,
:content=, :delete, :document, :document?, :element, :encoding, :errors, :first_element_child,
:fragment?, :hash, :inner_text, :inspect, :name, :name=, :namespace, :namespace_scopes, :next,
:next_element, :next_sibling, :object_id, :parent, :previous, :previous_element, :previous_sibling,
:processing_instruction, :root, :text, :text=, :text, :to_s, :to_unsafe, :to_xml, :type, :unlink,
:version, :xml?, :xpath, :xpath_bool, :xpath_float, :xpath_node, :xpath_nodes, :xpath_string,
to: @document
def initialize(uri, response)
super(uri, response)
@document = ::XML.parse(response.body)
end
end
end
end

111
src/ext/uri.cr Normal file
View File

@ -0,0 +1,111 @@
class URI
def split_path(path)
path.split("/")
end
def merge_path(base, rel)
# RFC2396, Section 5.2, 5)
# RFC2396, Section 5.2, 6)
base_path = split_path(base)
rel_path = split_path(rel)
# RFC2396, Section 5.2, 6), a)
base_path << "" if base_path.last == ".."
while i = base_path.index("..")
base_path = base_path[i - 1, 2]
end
if (first = rel_path.first) && first.empty?
base_path.clear
rel_path.shift
end
# RFC2396, Section 5.2, 6), c)
# RFC2396, Section 5.2, 6), d)
rel_path.push("") if rel_path.last == '.' || rel_path.last == ".."
rel_path.delete('.')
# RFC2396, Section 5.2, 6), e)
tmp = [] of String
rel_path.each do |x|
if x == ".." &&
!(tmp.empty? || tmp.last == "..")
tmp.pop
else
tmp << x
end
end
add_trailer_slash = !tmp.empty?
if base_path.empty?
base_path = [""] # keep '/' for root directory
elsif add_trailer_slash
base_path.pop
end
while x = tmp.shift
if x == ".."
# RFC2396, Section 4
# a .. or . in an absolute path has no special meaning
base_path.pop if base_path.size > 1
else
# if x == ".."
# valid absolute (but abnormal) path "/../..."
# else
# valid absolute path
# end
base_path << x
tmp.each {|t| base_path << t}
add_trailer_slash = false
break
end
end
base_path.push("") if add_trailer_slash
return base_path.join('/')
end
def merge(oth)
oth = URI.parse(oth) unless oth.is_a?(URI)
if oth.absolute?
# raise BadURIError, "both URI are absolute" if absolute?
# hmm... should return oth for usability?
return oth
end
unless self.absolute?
raise URI::Error.new("both URI are othative")
end
base = self.dup
authority = oth.userinfo || oth.host || oth.port
# RFC2396, Section 5.2, 2)
if (oth.path.nil? || oth.path.empty?) && !authority && !oth.query
base.fragment=(oth.fragment) if oth.fragment
return base
end
base.query = nil
base.fragment=(nil)
# RFC2396, Section 5.2, 4)
if !authority
base.path = merge_path(base.path, oth.path) if base.path && oth.path
else
# RFC2396, Section 5.2, 4)
base.path = oth.path if oth.path
end
# RFC2396, Section 5.2, 7)
base.user = oth.userinfo if oth.userinfo
base.host = oth.host if oth.host
base.port = oth.port if oth.port
base.query = oth.query if oth.query
base.fragment=(oth.fragment) if oth.fragment
return base
end
end