My Journey

About the Project
In the last eight weeks, I was a part of a cross-functional team consisting of front and back end, and Data Science, to build an application called CitySpire. This was the culmination of all our training here at Lambda to put together what we have learned, so we could design, build, and deploy an application from scratch.
Our main aim was to create an app that analyzes data from cities such as populations, weather, rental rates, crime rates, park (walk score), hospital rating, school rating, and many other social and economic factors that are important in deciding where someone would like to live.
We had to cohesively work together to plan out features we wanted to implement, which releases we could finish by certain deadlines, what tech stacks we were going to use, and what information was important to the users.
Planning Phase
On the first day, I met with the team of developers and data scientists I was going to work with. I was hesitant at first, letting others converse with each other while I just observed. But within minutes I joined in and discussed several user stories and what they meant for our respective teams. For example, the user story I worked on was "As a user, I want to see average rental/ house prices for a given city/ zip code". Now from a data scientist _ perspective,_ the user story boils down to –What is the predicted rental price of a home in a given zip code/ county in the next few months?
We as a team also decided that rental prices will also act as a separate endpoint i.e., a user can click on this endpoint and can get information about rental prices in their desired locality.
A good way to ensure everyone remains focused on cross-team goals and is up to date with regards to what others are doing and how things are progressing between meetings is to enlist the help of a project management app. Therefore, the next step was using Trello to communicate with our team every day. I would create a Trello card for every feature we wanted to implement, attached with a user story as the title which explained the experience we were aiming to achieve. I would divide these cards up by which team, front, back, data science would work on. This makes it easier to update what we are currently working on, our status, and what else the team is doing. We would have daily standups to update the team on the progress we were making with each feature and what possible barriers we encountered.

The next step to complete the story /feature was to break it down into smaller tasks. The above user story can be broken down into the following tasks-
- Deciding on the type of the data.
- Deciding on the source of the data.
- Gathering the data.
- Data wrangling.
- Exploring the data using various data visualization techniques.
- Creating a model according to the user requirements.
- Creating a database best suited for the datatype. For example, this is a relational database so I decided to use the AWS RDS Postgres database.
- Collaborating with the web developers both front end and back end to decide how the user input will be available to me and also how my model prediction will be delivered to them. This discussion includes deciding on the format of data, database selection, and creating endpoints.
Data
Five different datasets were used to build the machine learning model. First of all, I used the historical values of the Zillow Rent Index. And the rest of the data was obtained from the data world.

Data Preprocessing
Similar to many other publicly available datasets and many privately obtained ones, datasets used in this particular project contained a considerable amount of missing values.
The first step was to break down the dataset into smaller datasets, each containing values for just one year. Next, in each one of the newly created datasets, the zip codes that contain more than three months of missing values were removed. After cleaning missing data I joined each year back into one extensive dataset.

The next step in the data preparation preprocessing is normalization. Normalization was performed for several reasons. First, to ensure the linearity of relationships between explanatory variables and the target variable. Second, to ensure the normality of data distribution. Finally, to try and reduce the influence of the outliers on the distribution.
The next step in the data preprocessing journey was the removal of some of the noticeable outliers, which can significantly alter the distribution, the relationship between features and the target, and, as a result, reduce model efficiency.
The final step in the data preparation process was structuring data in a way that will allow for the capture of most aspects of relationships between dependent and independent variables. For analysis, all of the datasets were transformed to contain an average yearly value of each indicator. The main reason for this was the absence of the monthly level data in most of the datasets used.
Machine Learning
The dataset was split into the train and test datasets. I had built a linear regression model. I used two primary metrics for model performances – Coefficient of Determination (R-squared) and Root Mean Squared Error (RMSE). The first one helped me to observe how well my model describes the nature of relationships between the Zillow Rent Index and variables, and the second one, helped me to evaluate how close the model was, in absolute US dollar terms, to the actual rent values.
API and the endpoint
The next step was to build the API which includes creating and connecting the routes (endpoints), functions, and documentation for usage.
I used FastAPI to create the API and AWS Elastic Beanstalk was used to host the API.

Packages/Technologies used
- Python
- SQL
- Scikit
- BeautifulSoup
- AWS RDS PostgreSQL: Relational database service.
- AWS Elastic Beanstalk: Platform as a service, hosts your API.
- Docker: Containers, for reproducible environments
- FastAPI: Web framework. Like Flask, but faster, with automatic interactive docs.
- Plotly: Visualization library, for Python.
- Pytest: Testing framework, runs your unit tests.
For source code hosting I have used GitHub. **** Github provides version control and smooth collaboration. I have used the branching technique in GitHub to maintain better version control. I created a new branch for this feature. While pushing the code to the branch I also added a clear and detailed commit message to identify my changes. Then I created a pull request to notify my team members that I have completed a feature and asking their permission to add my changes to the main file.
Challenges I faced
Throughout this project, obstacles presented themselves with challenges I needed to overcome. I learned new tech stacks but sorting out some barriers I faced while learning.
- For my main feature, Zillow price estimates the problem was gathering data and then understanding the data. There were several CSV files. And there were no details about the various CSV files and how they were related to each other.
- Launching the app- Initially, I was not able to launch the app locally because of a version mismatch issue. Windows 10 no longer supports uvloop. So I had to remove psycopg2-binary and gunicorn **** from the pipfile, delete the piplock and then manually install psycopg2-binary and gunicorn back to the pipfile.
- I wrote the code which helped my team to web scrape the Indeed website and gather the required data. However, I had to create several notebooks because indeed the website blocked further scraping after 2 cities.
- Finally, the biggest challenge I faced was hosting the API in AWS Beanstalk. For some reason, I got a 502 gateway error. The error logs were also not helpful in resolving the issue. Therefore, had to switch to the EC2 route to host the API.
The Future
This application is still in the early stages. There are still a lot of features we would like to add, bugs that need to be looked into, and functionality that needs to still be worked on. This is just the beginning. This is my journal, documenting my journey with the contributions I have made so far. There is so much left to do.
I have learned so much working on this application. I was challenged every day, from learning new things like Flask API, Pydantic, AWS RDS PostgreSQL, AWS Beanstalk to optimizing my code for efficiency, to collaborating with my team. One of the best things working on projects like these as a data scientist is working with different personalities and meeting new people.
So, how does this project help me in my career goals? This project has challenged me to learn new things. The ability to plan and execute and meet deadlines of shipped features. To overcome barriers and still complete my tasks. As I move forward, I will continue to push myself, learn, grow, meet new challenges, and overcome them. I look forward to the future, as my journey has only begun.