{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping craigslist\n", "## Overview\n", "In this notebook, I'll show you how to make a simple query on Craigslist using some nifty python modules. You can take advantage of all the structure data that exists on webpages to collect interesting datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import pandas as pd\n", "from bs4 import BeautifulSoup as bs4\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to figure out how to submit a query to Craigslist. As with many websites, one way you can do this is simply by constructing the proper URL and sending it to Craigslist. Here's a sample URL that is returned after manually typing in a search to Craigslist:\n", "> `http://sfbay.craigslist.org/search/eby/apa?bedrooms=1&pets_cat=1&pets_dog=1&is_furnished=1`\n", "\n", "This is actually two separate things. The first tells craigslist what kind of thing we're searching for:\n", "\n", "> `http://sfbay.craigslist.org/search/eby/apa` says we're searching in the sfbay area (`sfbay`) for apartments (`apa`) in the east bay (`eby`).\n", "\n", "The second part contains the parameters that we pass to the search:\n", "\n", "> `?bedrooms=1&pets_cat=1&pets_dog=1&is_furnished=1` says we want 1+ bedrooms, cats allowed, dogs allowed, and furnished apartments. You can manually change these fields in order to create new queries.\n", "\n", "## Getting a single posting\n", "\n", "So, we'll use this knowledge to send some custom URLs to Craigslist. We'll do this using the `requests` python module, which is really useful for querying websites." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In internet lingo, we're posting a `get` requests to the website, which simply says that we'd like to get some information from the Craigslist website. With requests, we can easily create a dictionary that specifies parameters in the URL:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "url_base = 'http://sfbay.craigslist.org/search/eby/apa'\n", "params = dict(bedrooms=1, is_furnished=1)\n", "rsp = requests.get(url_base, params=params)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Note that requests automatically created the right URL:\n", "print(rsp.url)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# We can access the content of the response that Craigslist sent back here:\n", "print(rsp.text[:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow, that's a lot of code. Remember, websites serve HTML documents, and usually your browser will automatically render this into a nice webpage for you. Since we're doing this with python, we get back the raw text. This is really useful, but how can we possibly parse it all?\n", "\n", "For this, we'll turn to another great package, BeautifulSoup:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# BS4 can quickly parse our text, make sure to tell it that you're giving html\n", "html = bs4(rsp.text, 'html.parser')\n", "\n", "# BS makes it easy to look through a document\n", "print(html.prettify()[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beautiful soup lets us quickly search through an HTML document. We can pull out whatever information we want.\n", "\n", "Scanning through this text, we see a common structure repeated `
`. This seems to be the container that has information for a single apartment.\n",
"\n",
"In BeautifulSoup, we can quickly get all instances of this container:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# find_all will pull entries that fit your search criteria.\n",
"# Note that we have to use brackets to define the `attrs` dictionary\n",
"# Because \"class\" is a special word in python, so we need to give a string.\n",
"apts = html.find_all('p', attrs={'class': 'row'})\n",
"print(len(apts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look inside the values of a single apartment listing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\n",
" \n",
"
\n",
"\n",
" \n",
" \n",
" price \n",
" size \n",
" brs \n",
" title \n",
" link \n",
" loc \n",
" age \n",
" \n",
" \n",
" \n",
" \n",
" time \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" 2016-11-17 09:12:00 \n",
" 2450.0 \n",
" 850.0 \n",
" 3.0 \n",
" APARTMENT 3 BR/ 1 BT \n",
" /eby/apa/5873549022.html \n",
" eby \n",
" 2.952374 \n",
" \n",
" \n",
" 2016-11-17 09:12:00 \n",
" 2005.0 \n",
" 790.0 \n",
" 1.0 \n",
" Perfect Cozy One-Bedroom Available Now! Only $... \n",
" /eby/apa/5880750933.html \n",
" eby \n",
" 2.952374 \n",
" \n",
" \n",
" 2016-11-17 09:12:00 \n",
" 1875.0 \n",
" NaN \n",
" 1.0 \n",
" FANTASTIC PLACE - Lovely Yard! \n",
" /eby/apa/5870868630.html \n",
" eby \n",
" 2.952374 \n",
" \n",
" \n",
" 2016-11-17 09:11:00 \n",
" 1525.0 \n",
" 650.0 \n",
" 1.0 \n",
" Large 1 bedroom Washer/Dryer \n",
" /eby/apa/5865407315.html \n",
" eby \n",
" 2.946333 \n",
" \n",
" \n",
" \n",
"2016-11-17 09:11:00 \n",
" 6000.0 \n",
" NaN \n",
" 3.0 \n",
" 3BR/2.5BA Home Panoramic Views (90 Skyway Lane) \n",
" /eby/apa/5880673685.html \n",
" eby \n",
" 2.946333 \n",
"