Referrer/Web Server Log Integration

Overview

The referrer data in your web server logs capturing where users visiting your official sites were redirected from can be extremely valuable for detecting phish since many phishing pages redirect users to the legitimate sites of the brand they are targeting after obtaining a user's credentials. 

By integrating your HTTP Referrer logs with RiskIQ, especially for high priority / high profile web applications, you can automatically use that data to generate phish events and enhance your external threat detection and management.

HTTP referrer

The HTTP referrer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. By checking the referrer, the new web page can see where the request originated.

In the most common situation this means that when a user clicks a hyperlink in a web browser, the browser sends a request to the server holding the destination webpage. The request includes the referrer field, which indicates the last page the user was on (the one where they clicked the link).

RiskIQ leverages this referrer information captured in web server logs to perform automated analysis of the referrer URL to determine if there is a web page impersonating a customer's web access functionality for credential theft.

Web server log:

A log file (or several files) that maintains a history of page requests, information including client IP address, request date/time. Page requested, HTTP code, bytes served, user agent, and the referrer link. 

Logs formats:

There are many types of log formats, including  Common Log Format, Apache Commons Logging, Extended Log Format, and more. These will vary from web server platform, versions and configuration parameters.

Log Example:

216.219.177.29 - - [15/May/2000:23:03:36 -0800] "GET /login-page.htm HTTP/1.0" 200 3956 "http://www.mywebsite.com/user/accountpage" "Mozilla/2.0 (compatible; MSIE 4.0; SK; Windows 98)"

  • The IP address of your visitor -- 216.219.177.29

  • The date and time of the visit -- [15/May/2000:23:03:36 -0800]

  • The first file requested -- "GET /login-page.html HTTP/1.0"

  • The fact that the request was completed -- 200

  • The number of bytes that were transferred -- 3956

  • Referrer url -- "http://www.mywebsite.com/user/accountpage"

  • Browser and operating system of the visitor -- "Mozilla/2.0 (compatible; MSIE 4.0; SK; Windows 98)

Ultimately, we only need one log entry item, that is the referrer url, it provides a reference of the page a user was previously on, this information can be submitted to RiskIQ via FTP or API  (see below)

Extraction of referrer URLs:

Referrer URLs can either be extracted through a SIEM or other log analysis programs. 

Submitting Your Logs to RiskIQ

RiskIQ offers two ways to send your data: API or FTP

Via API

The recommended method is to use the RiskIQ landing page API endpoint to bulk submit URLs from your logs for crawling (landingPage/bulk). Submissions are specified in the request body via the HTTP POST method. For each submission, a response is generated marking whether the landing page was created and submitted or any reason for failure.

Each URL submitted via the bulkCreate method must be unique.

No more than 1000 URLs may be submitted per call.

Example Submission via Curl

curl -v -H 'Content-Type: application/json’ --basic -u <token:key> -X POST -d @example.json 'https://ws.riskiq.net/v1/landingPage/bulk'

Example JSON Submission Format

{
   "entry": [
     {
       "url": <frontier url>,
       "fields": [
         {
           "name": "description",
           "value": <text string>
         },
         {
           "name": "clickThrough",
           "value": <"true" if click through on requested>
         }
       ],
       "projectName": project,
       "keyword": <referrer url>
     },
     {
       "url": "http://www.testurl.com/somepage.html",
       "fields": [
         {
           "name": "description",
           "value": "Test URL"
         },
         {

          "name": "clickThrough",
           "value": "true"
         }
       ],
       "projectName": "LP - US - Desktop",
       "keyword": "http://www.referringsite.com/somepage.html/
     }
   ]

}

Via FTP

If you would prefer to use FTP, your RiskIQ technical account manager will create an FTP username and password for your workspace. 

Using those credentials you can upload a spreadsheet of URLs. The file must be tab delimited with columns in this order:

  • URL
  • MD5 (can be blank)
  • Keyword (this is referrer)
  • Custom1 (Optional, setup Incident field for landing page "Description".  Allows for filtering in searches to see landing pages from particular use cases (proxy log, referrer logs, other integration)
  • Custom2  (repeat if more than one is necessary)
  • .....
  • ProjectID (this will be provided by your RiskIQ TAM; it controls the crawl settings such as the proxy and browser-type that will be used in the crawl)

Retrieval of Results

Results may be retrieved via multiple API methods or managed via the Events interface.

Via Landing Pages

The Landing Page controller in the RiskIQ API allows you to pull back all of the submitted URLs (submissions can be either API or FTP) and see the details for each one that was submitted, regardless of whether or not any phishing or other relevant content was identified there.

Landing page submission is asynchronous. When you submit a landing page you will be returned the MD5 as well as a result with some placeholder information. You then have a set of choices in terms of how to get notified when the landing page crawl has completed and been inspected against event policies. These are:

Direct retrieval / Polling

You may query via the get endpoint to see if a landing page has completed. You can tell a landing page crawl has completed by looking at the crawlDatablock in the result. If the block is missing or empty, then the landing page has not completed. Within the crawlDatablock, there is also a crawlInspectionResults section. If the crawlDatablock is populated but this section is empty, that means the landing page crawl has completed, but the results have not yet been analyzed to create events. the crawlInspectionResults section will contain the event ID of any event created, and/or tell you whether no event was made because there was a pre-existing event for this submission already, or if no policy violation was found.

You can do a GET request on the event ID in order to find more details on it, such as enforcement-related information if there were takedown notices issued on behalf of this event.

Batch Retrieval

Another option is to retrieve all submitted landing pages in batch, using timestamps to page through recent data. You may use the crawled endpoint to receive all recently crawled landing pages. Just as with direct retrieval, the crawlData and crawlInspectionResults will tell you if the landing page submission / inspection has not yet completed.

Ping Back

The final option is for us to notify you either when a particular landing page crawl has completed or when it has both completed the crawl and the policy analysis to potentially generate an event (if applicable). We can send a simple HTTP GET request to a particular URL specified in the landing page submission. Our ping back request contains no information, but upon receiving it you can then use the get endpoint to retrieve the information about that landing page, including the crawlData and crawlInspectionResults.


We also support a global configuration for the ping back URL based on convention, so that we can generate the URL without you having to specify it on each request. Some clients find this less troublesome than submitting the ping back URL with each request. Contact your RiskIQ representative for more information.

Via Events

Events are created only for URL submissions that meet policy / business logic criteria (such as being phishing page or having blacklisted content), so using events allows you to see those results only, filtering out details about URLs that were inspected but not found to be dangerous. Events can be viewed in either the RiskIQ web interface (https://app.riskiq.net/) or via the RiskIQ API using the events controller.

Direct retrieval / Polling

You can do a GET request on a particular event ID to get information on it, including current status, assigned owner, priority, any tags applied, first, most recent crawls, and next scheduled crawls, plus information on any enforcement actions taken in relation to this event.

Batch Retrieval / Search

Both the events UI and API support a rich set of filtering capabilities including time of creation, as well as other attributes about the event, including URL, Whois and hosting details, various attributes of the page content, etc.

API POST 

Similar to the Landing Page ping back option, a real-time post option is offered for events as well. With a provided post URL, users can opt to receive a feed of incoming events in real-time. Events may be posted to different URLs based on the event-type.