In one of my grad school courses, we had a semester-long challenge to design an agent that could solve Raven’s Progressive Matrices problems – visual logic problems used to evaluate human intelligence. In these problems, the goal is to find a solution that satisfies the visual analogies in both the horizontal and vertical directions.
Our goal was to design an agent that could rival human performance on basic problems. I was particularly proud of my agent’s performance after months of work, so I’m including it here to show a slightly different type of project than the others in my portfolio.
An example 3×3 Raven’s Progressive Matrices problem where the answer is 1.
Problem Setup and Preprocessing
Using only the raw problem images as input, the agent was expected to output the number of the image it determined was the answer. To start, my agent read the images using Pillow
and converted the pixel data to numpy
arrays. This allowed for faster vectorized calculations throughout the logic.
After using Pillow’s numpy conversion, I simplified the RGB 3-dimensional array into a 2-dimensional array with either 0 or 1 for white and black pixels. This required picking a gray threshold value to determine which pixels would be considered black or white. I then further optimized the problem by downsampling each array by 50% in width and height. This improved runtimes by 4x.
Converting sample circle into RGB 3d array before flattening to 2d binary array.
Transformations
The agent’s basic approach to the problem was to apply many potential transformations to the problem images to determine which variation best matched the problem. To create these transformations, I first created ten building block transformations that included the following.
- Rotation – Turning an image by a specified number of degrees.
- Reflection – Flipping an image either horizontally or vertically.
- Overlay – Combining the black pixels of two images.
- Intersection – Creating image with only the black pixels of both input images.
- Difference – Creating image with the black pixels of only one of the input images.
- Fill – Changing contiguous white pixels from a point to black.
- Shift – Creating new image by moving an input image horizontally and vertically.
- Center – Giving black pixels equal padding both horizontally and vertically.
- Scale – Increasing or decreasing an image in size by a specified amount.
- Duplicate half of the image – Creating image where the left, right, top, or bottom half of the image is copied onto the opposite half of the image.
These building block transformations were then used alone or combined in creative ways to generate dozens of potential transformations.
Transformations Example
To show how the transformations can be combined to change an image, consider the above example. The Start image can be changed into the End image in 7 transformation steps.
- Shift the starting image up until the upper circle is offscreen. The transformation removes any of the image shifted off screen, and any new part of the image shifted on screen is blank / white.
- Shift the image down until bottom circle is offscreen.
- Shift the image right until right circle is offscreen.
- Shift the image left until final circle is offscreen.
- Center the remaining image (just the inner diamond).
- Scale that diamond down slightly.
- Get the difference between the diamond from step 5 and the scaled down diamond from step 6. The red part of the image is the difference that is removed. The remaining shape is the final image.
Answer Evaluation
The agent identified the visual analogies by applying its known transformations to the problem both horizontally and vertically to evaluate which ones match. For example, in the problem below, the agent would apply many transformations to A/B and see how the resulting image matched C. For a transformation that appeared to match well, it would then see if the same transformation on D/E resulted in F. If it found a transformation that matched both rows, it then applied it to the final row to see if a resulting image was a potential answer. This process was then repeated vertically to ensure the same answer satisfied both directions.
For example, in the above problem, the agent first tries to ignore the first column (as many 3×3 problems can actually be solved as 3×2 problems). It begins applying transformations to B, some of which are shown in the first row below. Each of these transformed B images are then compared to C to see how well the transformation matches the problem / satisfies the visual analogy.
The same transformations are then applied to the second row. So image E is transformed, and the resulting images are compared to F. This is shown in rows 3 and 4 below. The average of these two similarity scores is then used to see how well a transformation satisfies the entire problem. In this case, duplicating the right half of the image has a perfect match for both rows.
Finally, the same transformations are applied to the final row. So, each transformation is applied to image H. Each resulting transformed H is then compared to each candidate answer to see if it exists. In the below example, this is visualized by overlaying each transformed H over each answer. The similarity scores across all steps are combined to find the agent’s answer. In this problem, the first two rows had “Duplicate image right half” as the best transformation. And in the final step, that same transformation applied to H had a matching answer. So the agent would choose that answer.
The above example is a straightforward one where the agent cleanly can find an answer. But many optimizations were also added to help the agent answer when the known transformations couldn’t produce an answer. For example, this included ranking the probabilities of each answer being correct based on known transformations and making “educated guesses” at the correct answer.
Performance
This agent correctly answered every single problem it was presented across the B, C, D, and E sets of the Raven’s Progressive Matrices. There were additional advanced problems that the agent struggled with since it didn’t have a sufficient knowledge base to approach those problems. However, the basic architecture of this agent would allow it to tackle any visual analogy with a sufficient transformation knowledge base.