96 lines
2.5 KiB
Markdown
96 lines
2.5 KiB
Markdown
# Arachnid
|
|
|
|
Arachnid is a fast and powerful web scraping framework for Crystal. It provides an easy to use DSL for scraping webpages and processing all of the things you might come across.
|
|
|
|
## Installation
|
|
|
|
1. Add the dependency to your `shard.yml`:
|
|
|
|
```yaml
|
|
dependencies:
|
|
arachnid:
|
|
github: watzon/arachnid
|
|
```
|
|
|
|
2. Run `shards install`
|
|
|
|
## Usage
|
|
|
|
Arachnid provides an easy to use, powerful DSL for scraping websites.
|
|
|
|
```crystal
|
|
require "arachnid"
|
|
require "json"
|
|
|
|
# Let's build a sitemap of crystal-lang.org
|
|
# Links will be a hash of url to page title
|
|
links = {} of String => String
|
|
|
|
# Visit a particular host, in this case `crystal-lang.org`. This will
|
|
# not match on subdomains.
|
|
Arachnid.host("https://crystal-lang.org") do |spider|
|
|
# Ignore the API secion. It's a little big.
|
|
spider.ignore_urls_like(/.*\/api.*/)
|
|
|
|
spider.every_page do |page|
|
|
puts "Visiting #{page.url.to_s}"
|
|
|
|
# Ignore redirects for our sitemap
|
|
unless page.redirect?
|
|
# Add the url of every visited page to our sitemap
|
|
links[page.url.to_s] = page.title.to_s.strip
|
|
end
|
|
end
|
|
end
|
|
|
|
File.write("crystal-lang.org-sitemap.json", links.to_pretty_json)
|
|
```
|
|
|
|
Want to scan external links as well?
|
|
|
|
```crystal
|
|
# To make things interesting, this time let's download
|
|
# every image we find.
|
|
Arachnid.start_at("https://crystal-lang.org") do |spider|
|
|
# Set a base path to store all the images at
|
|
base_image_dir = File.expand_path("~/Pictures/arachnid")
|
|
Dir.mkdir_p(base_image_dir)
|
|
|
|
spider.every_page do |page|
|
|
puts "Scanning #{page.url.to_s}"
|
|
|
|
if page.image?
|
|
# Since we're going to be saving a lot of images
|
|
# let's spawn a new fiber for each one. This
|
|
# makes things so much faster.
|
|
spawn do
|
|
# Output directory for images for this host
|
|
directory = File.join(base_image_dir, page.url.host.to_s)
|
|
Dir.mkdir_p(directory)
|
|
|
|
# The name of the image
|
|
filename = File.basename(page.url.path)
|
|
|
|
# Save the image using the body of the page
|
|
puts "Saving #{filename} to #{directory}"
|
|
File.write(File.join(directory, filename), page.body)
|
|
end
|
|
end
|
|
end
|
|
end
|
|
```
|
|
|
|
More documentation will be coming soon!
|
|
|
|
## Contributing
|
|
|
|
1. Fork it (<https://github.com/watzon/arachnid/fork>)
|
|
2. Create your feature branch (`git checkout -b my-new-feature`)
|
|
3. Commit your changes (`git commit -am 'Add some feature'`)
|
|
4. Push to the branch (`git push origin my-new-feature`)
|
|
5. Create a new Pull Request
|
|
|
|
## Contributors
|
|
|
|
- [Chris Watson](https://github.com/watzon) - creator and maintainer
|