Blog
Published on

Scraping Google Images in Rust

At some point in a programmer’s journey, one is bound to encounter web scraping: fetching information from a website in a non-standard fashion. Usually, we have to resort to web scraping if an API is non-existent or severely lacking in functionality, or simply too expensive.

Some websites are harder to scrape than others, which can be due to a variety of causes such as:

  • Dynamic content lazy-loaded with JS
  • Obfuscated HTML structures
  • IP blocking or rate limiting
  • CAPTCHAs

In this article, I’d like to explain how to scrape full-resolution images from Google in Rust, as the approach is rather different from what you’d expect. Although I could just wrap the Rust code up in a crate and publish it, I’d rather take the time to explain the general procedures to replicate within any language.

Avoiding Selenium

Initially, when I had to write image-grabbing code for ace-rs, I simply thought it was a matter of sending a GET request, selecting all <img> tags, and saving the src="[url]" attributes.

However, after some experimentation, I found that the img tags only contain a base64 encoded jpeg string of the heavily compressed and resized thumbnail, instead of a link to the full-resolution source image directly. In the interactive site, once you click on the thumbnail, it opens the full image using Javascript.

After encountering this issue, I was considering just hacking together Selenium code in a headless instance, but I decided to keep this as a last resort due to its several disadvantages as opposed to direct HTTP scraping:

  • It suffers in performance due to having to render and process the entire web page
  • Requires huge dependencies such as browser drivers and adds unwanted platform-specific setup complexity

Finding the data source

So I decided to keep digging around the client-rendered HTML, network requests, and scripts. My best bet was to carefully check for any possible directly stored “data sources” that the JS relied on.

Aha! After carefully navigating through countless script tags containing miscellaneous JavaScript calls, I finally uncovered the gold. This AF_initDataCallback function seemed to have just what I was looking for, albeit in a severely incomprehensible and nested format with no clear keys in the JSON.

By putting it into a JSON prettifier and probing for any consistent patterns, I noticed the target URL was always embedded in an array matching the following constraints:

  • First value (target URL) → type String
  • Second and third values (irrelevant) → type Number

One last filter I needed to apply was excluding all links that began with https://encrypted-tbn0.gstatic.com — the cached thumbnails.

Implementing with Rust

With the theory out of the way, let’s dig right into the code! For this article, we’ll need the following dependencies first:

[dependencies]
anyhow = "1.0.71"
tokio = { version = "1", features = ["full"] }
regex = "1.8.4"
reqwest = "0.11.18"
serde = "1.0.163"
serde_json = "1.0.96"
json5 = "0.4.1"

  • json5 is needed due to the JSON string containing unquoted keys.
  • reqwest is an easy-to-use HTTP client
  • serde for all parsing and de-serialization purposes
  • anyhow allows us to use ? (error propagation) with multiple types of Error values

Let’s begin by defining an asynchronous function called fetch_images that should output a Vec of URLs wrapped in anyhow::Result. Embed the query inside of the Google Images link, and we’ll build a request with a custom user-agent (avoids flagging their systems). Then, execute the GET request to fetch the main HTML.

async fn fetch_images(query: &str) -> Result<Vec<String>> {
    let url = format!("https://google.com/search?q={query}&tbm=isch");
    let client = reqwest::Client::builder()
        .user_agent("Mozilla/5.0 (Linux; Android 9; SM-G960F Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36")
        .build()?;
    let content = client.get(&url).send().await?.text().await?;
    // ...
}

Now we have to Regex to pinpoint the location of the JSON data inside the AF_initDataCallback function mentioned earlier and unwrap the value of data using json5. Index 56 seems to be the general area where the target URLs live.

let regex = Regex::new(r"AF_initDataCallback\((\{key: 'ds:1'.*?)\);</script>").unwrap();
let found = regex.captures(&content);
if let Some(found) = found {
    let cap = found.get(1);
    if let Some(cap) = cap {
        let json: Value = json5::from_str(cap.as_str()).unwrap();
        let decoded = &json.get("data").unwrap()[56]; // unorganized raw data
        // TODO: filter ...
    }
}

Ok(vec![])

Sweet, now a huge blob of messy JSON data with mostly unwanted information is at our fingertips. All we have to do now is apply the previously mentioned filters to extract what we’re looking for.

Let’s first define a helper function filter_nested_value that takes this blob and converts it into something we can work with. How? It recursively collects only the arrays with three elements, with which we can further filter down with our conditions.

// utility function
fn filter_nested_value(value: &Value) -> Vec<&[Value]> {
    match value {
        Value::Array(arr) if arr.len() == 3 => vec![arr.as_slice()],
        Value::Array(arr) => arr.iter().flat_map(filter_nested_value).collect(),
        Value::Object(obj) => obj.values().flat_map(filter_nested_value).collect(),
        _ => vec![],
    }
}

Back to handling the original decoded variable. This part is quite straightforward with a clever match statement using a match guard.

let urls: Vec<String> = filter_nested_value(decoded)
    .into_iter()
    .filter_map(|arr| match arr {
        [Value::String(string_val), Value::Number(_), Value::Number(_)]
            if !string_val.starts_with("https://encrypted-") =>
        {
            Some(string_val.to_string())
        }
        _ => None,
    })
    .collect();

return Ok(urls);

Conclusion

Although maybe not the most complicated “reverse-engineering” feat, the outlined approach to scraping for images from Google is undoubtedly involved and un-intuitive.

Another risk of web scraping that we must acknowledge is the lack of stability. This code is not guaranteed to work years down the line, in case they change up their website. However, this is our best (free) option, as their official Image Search API was deprecated in 2011.

Hope this article helped! Here’s some further reading if interested: