Homie

A Streamlined Amenity Detection and Property Management Platform

Problem Definition

Over the last half-decade, the market for vacation rentals has largely centralized to a handful of vacation rental websites. The days of having a single listing on one website are largely over, as people list their properties on multiple different websites in the hopes of attracting customers. The process of uploading the listing of each house on each of the websites can become quite tedious and repetitive. If you are listing a property, you have to walk through and get a tally of all of the different amenities in the house. You also have to make sure that you don't miss anything, or else your engagement will fall because of a sparse amenity list. The natural next step is to find a way to automate this process and make it easier for people to list their property or properties on different vacation rental websites without wasting time manually recounting and uploading the same data over and over again.

Homie is a platform that seeks to solve this problem. With a few clicks, users can upload all of the pictures within their home, and we will automatically generate a listing of all of the amenities. If users have multiple properties, they can manage the amenities within each property and easily export the amenities for each one to the popular vacation rental websites. With Homie, the time spent going through each room in each property is cut exponentially. You only need to take pictures of each room once and upload them to our website, and we will keep track of all of that information for you.

System Design

The user interacts with the platform through a website. The front end was built on top of the Bootstrap framework using HTML, CSS and Javascript. The user has access to 4 pages:

Home Page: The user can log in or register here.
User Landing Page: The user lands here after logging in or registering. The user can see their list of properties, alongside a link to create a new property.
View Property Page: The user can view their property, with access to the property name, labels, and photos (both original uploads and annotated images with bounding boxes). They can also update the labels.
New Property Page: The user can specify the name of the property and upload photos to create one.

The website is served by an Express server running in Node.js, which we decided to use because we had prior experience building a web app with this framework. The user interacts with the server through the website and form submissions. The user can post information by registering as a new user and by creating new properties; and she can get information by logging in and viewing properties.

The crux of the platform is the labeling of properties' amenities by their photos. This is accomplished through a Detectron 2 object detection model (more on this below). The ML model is served through a Flask app running in Python, which was chosen because of its ease of use, and also because the model had a convenient Python API. The model is initialized at server start to allow for quick inference. The server accepts requests containing image URLs, downloads the images locally, and passes the local image paths to the model. The model outputs 50 predictions, each of which contains the predicted class, the confidence score, and the bounding box on the image. Note that the same class can appear more than once in this list. We then take all the predicted classes with a confidence score above 20% and union them across all the input images. This gives us an aggregated lists of labels for the property. We also annotate the input images with bounding boxes, and send their URLs back in the API response.

For database storage we use MongoDB, chosen due to prior experience with the technologies, and also because the document-oriented database model suited our needs better in terms of data access. User and property information are stored in MongoDB under 3 collections: User, Property and Misclassified. The User collection has objects that store a user's username, email, (hashed) password, and a list of properties. The properties are stored as ID's which reference objects of the Property collection. A property object has as its fields the name of the property, the amenity labels for the property (aggregated over all the photos) and the model's predicted labels, which may differ from the property's labels due to potential user updates. This collection also stores the URLs for the original photos and annotated photos (with bounding boxes around the objects). Finally, we have a collection for misclassified examples. Whenever a user updates the labels for a property, the user's labels are taken as ground truth; given the ground truth, the false positives and negatives of the model's prediction are recorded in this collection, to be used later for online training. The Express server alone interacts with the MongoDB database, which does so through mongoose, a library built on the official MongoDB Node.js driver.

For image storage we use Cloudinary, a cloud image storage solution that has seamless integration with Node.js and Python, allowing easy download and upload of images. We pass image URLs in the server requests and responses, and not image files/pixel arrays, so as to minimize payload size and potential errors. We also use multer, a Node.js middleware for handling file uploads through HTML forms.

After being able to successfully host our web app locally, our next stage was to deploy so that users all over the world can access it. This was done by leveraging docker compose to create a docker container with three images, one for the Flask app and model, one for the Node app, and one for the mongoDB interface. Docker allowed us to wrap up all our requirements in one box, making the process to host our application on Google Cloud seamless. We leveraged GCP's container registry as well as a virtual machine to deploy our app.

ML Component

The main ML component of our platform is an object detection model that is able to detect 30 different amenities within images. For the model, we took inspiration from Daniel Bourke's AirBnB Amenity Detection project. The implementation is based upon the popular RetinaNet architecture from Facebook's Detectron2 library. The model was trained on the OpenImages dataset with 35,694 images scraped from the internet that all contain at least one of the 30 amenity classes. As a starting point, we used Daniel Bourke's pretrained model for amenity detection.

The main downfall of Daniel Bourke's approach is that the dataset is not representative of the data we would receive once deployed. Most of the training images are random photos that happen to contain a table or a chair. In reality, our model will be receiving pictures of rooms similar to the pictures that are often seen on AirBnB. To account for this disparity we decided to implement online learning, so that the model can continuously learn from the feedback it received from its users.

Our model continually improves as people continue to use the product. We do this by storing information about which predictions were correct and which amenities we missed in our predictions. Once we aggregate a certain amount of data from our users, we create a new instance of our model with the most current weights. In order to train the model, we need the images, the classes in each image, and the bounding boxes for each image. False positives are fairly straight forward, as you just remove all of the predictions for that class. The biggest hurdle is to learn the false negatives because we do not ask our users to draw bounding boxes for our incorrect predictions. The way that we formulate approximations for the bounding boxes is by looking at the entire list of predictions for the images of the property. We take the highest confidence bounding box for each false negative class across all of those images and use that as our new label. This is not a perfect method, as the bounding box isn't always correct. However, the most important part of our product is that the model predicts the right class and not necessarily the right bounding box. Once we have the new dataset, we finetune the model on the examples, and then overwrite the previous weights. We then reinitialize our live model using the new weights.

System Evaluation - ML

Since this is an object detection problem, our main method for evaluation is average precision (AP). The formula for average precision is formally defined as the integral of the precision curve with respect to recall, reported on a 0-100 scale. For the optimized model, the AP was 42.788. For comparison, a vanilla RetinaNet model that was fine tuned on the dataset for 500 epochs achieved an AP of 10.36. As one can see, Daniel Bourke's model was optimized greatly on this dataset. From here, we decided to look into AP by class.

By looking at this, it is easy to recognize the disparity between classes, as the Billiards have an AP of 80.617 while Showers have an AP of 2.575. One possible reason for the spread in AP is that either there are many fewer shower examples in the train set, or there are few shower examples in the test set which may skew the numbers. We tested this theory by seeing how the number of examples correlates with the accuracy of the prediction.

By looking at these plots, it is not clear that adding more examples of classes with poorer performance will necessarily increase the performance of the model. To verify this theory, we decided to perform data augmentation with the lowest scoring classes. To do this, we gathered images that contained at least 1 of the bottom 6 classes, flipped each image, and then performed a Gaussian Blur to create noise. We then finetune on this dataset to see if feeding more, tougher examples of the lower classes would increase the overall AP score. Fine tuning for 50 epochs, we get an AP of 42.829 (an increase of 0.041). However, if we break it apart by class once more, we see that the AP scores of the lowest classes remained relatively unchanged.

Of the 6 classes that we augmented and fine tuned on, 3 of them saw a decrease in AP. So, the disparity between classes is most likely attributed to a deficiency in the architecture of the model and not necessarily the data that is being fed in. Another possible explanation is that the images are much harder than those that our model will most likely be servicing. This is because the images that you find on popular vacation rental websites display clearly all of the different amenities in the images. Our training set was not made for the purpose of amenity detection, but is merely an aggregation of images that happen to contain the different amenities.

The last point of emphasis for us, especially as it pertains to our final product, is how our model performs for specific types of rooms. To do this, we created 5 subsets of the classes: bedroom, living room, kitchen, bathroom, and outdoor, and we compared the AP's between the groupings.

We see that there actually is quite a large gap between different types of rooms. Bedrooms see an AP of approximately 20 points higher than that of the bathroom. This can be quite valuable information for the model and the platform itself. If we have prior information about what kind of room we are making a prediction for, we can adjust our model so so as to maximize AP.

Qualitatively, we can find some patterns with what the model seems to be missing. To do so, it is more helpful to feed in images that would be more representative of the test set. By looking at some sample inputs of rooms, there is one clear pattern of misses by the model. Below is a sample output from the model. The model does a fair job at making the predictions. However, it is unable to predict the dishwasher. The possible explanation for this is that the couch partially occludes it from view. This is quite typical of pictures of rooms, as it is hard for people to fit everything in the pictures perfectly. In the wild, this can be a downfall to the accuracy of the model, and an area for future improvement. This can hopefully be helped by the online learning of our model, or a new dataset that specifically contains partially occluded amenities.

The model does a fair job at making the predictions. However, it is unable to predict the dishwasher. The possible explanation for this is that the couch partially occludes it from view. This is quite typical of pictures of rooms, as it is hard for people to fit everything in the pictures perfectly. In the wild, this can be a downfall to the accuracy of the model, and an area for future improvement. This can hopefully be helped by the online learning of our model, or a new dataset that specifically contains partially occluded amenities.

System Evaluation - App

We asked close friends to evaluate our system on the following questions at different points in our app iteration process.

On a scale of 1 to 5, please evaluate:
- Intuitiveness
- Usefulness/functionality
- Responsiveness/speed
- ML prediction performance
- Ease of managing multiple properties
- Overall experience

What can be improved about the platform?

For an earlier version of the app, we received a feedback highlighting that users were not able to change the labels assigned by the model to the property. This led us to add a user update functionality. For our last version of the app, here are the summary statistics gathered from 5 respondents, along with some insightful comments we received:

Intuitiveness: 4.1
Usefulness/functionality: 4.1
Responsiveness/speed: 3.8
ML prediction performance: 4.8
Ease of managing multiple properties: 4.2
Overall experience: 4.6

"Great user experience and easy navigation. I would recommend this website to all my friends."
"Adding the utilities that the algorithm didn't recognize is very easy"
"The prediction was overall very accurate and recognized most essential utilities that were in the photos"
"Can accept more/different image file types"
"Predictions can be a little quicker"
"The registration page can be separate from login to make it look better"

We realize that model prediction speed is an avenue for improvement, which could be handled using multihtreading or GPU parallelism. Additionally, we were able to test our website to see if it can handle multiple simultaneous requests to the same routes (eg. creating two properties from different accounts at the same time), which it was able to handle without significant time delays.

App Demo

We go to the static IP address of our server running on GCP. The homepage gives us the option to log in or register. I'm just going to log in to my existing account, and I am taken to the user landing page, with a navigation bar displaying my username and a logout button. In the middle I can see some of the properties I have created before, with links to their individual pages. Say I want to put my house up on a property listing website; so I click Create New Property:

Here I specify the name of the property, and upload as many photos of the property as I want. Then I click Create Property and let the model work its magic!

After a short wait, I land on the property viewing page that was just created. On this page I can see all the images I uploaded, with an option to show and hide the bounding boxes for each predicted amenity. I also see on the side all the amenities our model has found, and the rest of the possible labels. There is also a textbox with a plain-text list of all the amenities, for easier copying. The amenities that our model has found are pre-checked, allowing for easier modification if need be.

If I think there's something wrong with the predicted labels, I can easily check and uncheck and click Update. The website refreshes to show me an updated property page, making the necessary changes in the database as well. Furthermore, the mislabeled classes (and the associated photos) are sent to a database as feedback for online model training.

I can also go back to my landing page and view properties I have created before.

Reflection

With this project, we streamline the use of a state-of-the-art object detection model for the end user, creating a platform to create and manage multiple properties. Our contribution makes the model easy to interact with from an end-user perspective, and gives a compelling use case once fully implemented.

Even though our final product is on amenity detection, our initial plan was to build an on-device landmark detection app. We even found a TFLite model for the task that was small in size and performed quick inference. However, we discovered a few days before the MVP deadline that the model had some version inconsistencies, making the output of the model completely unusable. For this problem, we reached out to the engineers from Google who maintained the model. The engineers actually responded, acknowledging the version issue, and told us that they would fix it. Even though the model was not fixed in time for the time scope of our project, we were happy to have contributed to that issue. Thankfully for us, the MVP code we wrote was properly modularized and the tasks were similar enough that we were able to pivot quickly and adapt most of the code from the landmarks projects to the amenities project.

We chose to have separate servers for the website and the model, because we had experience building websites with Express servers, and using ML models with Python. We realize that the entirety of the infrastructure could have been served by a single server, but we did not focus on this integration because we did not suffer from any major latency issues arising from server-to-server interactions.

There were a couple of extensions we would have worked on if we had more time and resources:

Integration with property listing sites: We would have explored the possibility of providing an easy way of transferring the uploaded photos and generated amenity lists to property listing websites. If such websites expose any public APIs, we can automate even more of the work that the user has to do.
More complete listings through NLP: Properties are usually listed with a description of the home. Most of the information contained in these descriptions could be inferred from the photos (along with some general guidance from the users about length and sentiment). We would want to explore some image-to-text algorithms for description generation, and provide the user with a complete listing of the property.
Better security: While we have implemented user authentication, we do not have proper authorization implemented. We would implement browser session functionality to store user information on the browser, along with form sanitization to prevent acts of malicious intent.

Broader Impacts

The intended use of this application, as previously alluded to, is for people to have one place to store all of the information regarding their properties. In the future, our product would seamlessly integrate with popular vacation rental websites, so that users can drastically cut the amount of time they spend doing repetitive tasks. One possible downfall and associated risk for our product is that it makes it easier for people to catfish on the popular vacation rental websites. Our final product would provide people the chance to upload their properties within a few clicks because of our integration. People could then use our application to upload thousands of fake listings on different websites in a fraction of the time. This is something that we would want to look to prevent. One way to do this is to add a reCAPTCHA to mitigate the impact that bots have on our product.

Another way that our product could be manipulated is by manipulating the online learning algorithm. Right now, there are no safeguards for the online learning component. So in theory, users could give a lot of incorrect information about false positives and false negatives. This could throw off the algorithm drastically, as there is not a critical mass of users to counterbalance this effect. One way to fight this is by keeping our own internal test set with images that we consider to be representative of what our model would be seeing. Whenever we do our online learning, we could then test it on this test set and make sure that it performs above a given threshold. If it doesn't, that means that the feedback we have been given may not be entirely correct. We can also implement some regular checks for statistical testing of distribution shifts between our training set and the labels we receive to be ground truth from users. Any significant differences might require further investigation.

Contributions

Kaan Ertas: UI, backend infrastructure, GCP deployment

Pujan Patel: Backend infrastructure, GCP deployment, ML model deployment

Max Pike: GCP deployment, ML model iteration/deployment/online learning