For my freshman internship at Jio, I was assigned a job that felt rather herculean to me on the first day. The commercialization of the geospatial industry has led to an explosive amount of data being collected to characterize our changing planet. One area for innovation is the application of computer vision and deep learning to extract information from satellite imagery at scale. The problem statement was something on the lines of using Geographical Information Systems to extract features from satellite images in challenging locations like the Dharavi region of Mumbai. Rooftop detection is something that has been done before, and rather extensively, as I found out from a quick literature review. However, it's not something that was done conclusively before for very challenging or varied scenarios. The SpaceNet Challenge is a yearly challenge for developers to try their hand at developing a single solution to detect rooftops in many different landscapes (different cities, shapes of roofs, etc.), and that's a testament to how annoying it can be to make one solution that works for different landscapes. Or that's sort of what I thought when we started out.
My first call to action was to get myself acquainted with Computer Vision and learn Image Processing from scratch. I took up a quick course on Lynda to cover the basics of IP, and actually found out more than I figured I needed. This internship, while short, was also sort of my first foray into the works of a researcher, reading up paper after paper and trying to understand the solutions at first. As a freshman, I didn't have enough idea in terms of implementation to intuitively understand the logistics behind the implementation. Thus, at first, I chose to identify and formulate an architecture that would work, and present that in my findings.
I started by pre-processing satellite images myself. I took filters that I had learned on Lynda and mixed and matched them to understand how hard it is to do with basic image processing, and what kind of problems I'd encounter in that process. As one would expect, the main problems were the different shapes of roofs, the inability of plain image effects to find differences between roofs and lakes (or any form of depth detection, really). Another issue that was unique to this problem was that the little huts in Dharavi as per the satellite images had brown roofs for the most part, and they really blended in with the ground in many places. It was hard enough for the naked eye to tell them apart, let alone expect image filters to get the job done. However, what I did realize was that a pipeline of different pre-processing techniques could do a pretty respectable job of identifying rooftops.
There are a few problems with the pre-processing only approach. It isn't one size fits all by any means, and that's something I knew right off the bat. However, it creates problems that I didn't particularly expect. Firstly, it's effect varies largely on the kind of landscape that's thrown at it, which means areas near water bodies or ones that have a lot of open lands produce inaccurate results, considering the image up there does create borders around large areas of green and a small pond as well, which it shouldn't have. Similarly, zoom levels and lighting also become important factors for this approach. If the satellite image isn't exactly dialed into the algorithm, that is, if the image isn't zoomed to a similar amount or has different lighting conditions, you'll find a result filled with aberrations. However, what I figured was that if the contouring, blurring, and thresholding levels were tuned to each image, something like this could effectively get the job done for a variety of terrains. The first thing I did to remedy that was to use Otsu's Binarisation instead of a plain binary thresholding algorithm. That ended up becoming a good solution in the holistic sense. The other things I couldn't find direct dynamic solutions to at the time.
This is a note I wrote very recently, as opposed to the rest of this, which is an amalgamation of my thoughts with results from my internship report. In retrospect as a final year student with considerably more experience in the field, I think I focused a lot on pre-processing, which created solutions that were pretty impressive still. The base problem with that approach was that it was a dead-end in terms of work to be done on it. There's no particular roadmap to improvement in a pure pre-processing pipeline unless I create an algorithm for contouring that takes into account depth and spatial information. I should have tried to look for datasets that do something of the sort for other cities that have similar compositions. The SpaceNet datasets had some Chinese datasets that could've proved useful for the same. Another approach that I was entirely oblivious to was transfer learning. Using transfer learning with the ResNet dataset could've proved crucial to making a good convolutional neural network for this process. In recent years, that has in fact where the project has led into after my internship period. The team that took over my progress refined my pipeline to understand the process I took and extended it on to make an AR application that could display contours on buildings and assign labels to them.
 Maloof, M., Langley, P., Binford, T. et al. Improved Rooftop Detection in Aerial Images with Machine Learning. Machine Learning 53, 157–191 (2003). [https://doi.org/10.1023/A:1025623527461](https://doi.org/10.1023/A:1025623527461)
 B. Joshi, H. Baluyan, A. A. Hinai and W. L. Woon, "Automatic Rooftop Detection Using a Two-Stage Classification," 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, Cambridge, 2014, pp. 286-291, doi: 10.1109/UKSim.2014.89.
 R. Delassus and R. Giot, “CNNs fusion for building detection in aerial images for the building detection challenge,”CoRR, vol. abs/1809.10976, 2018