The goal of this project was to develop a tool that aspiring restaurant owners will use to scout for new restaurant locations. The location for a new restaurant is often chosen without a careful consideration of available data, and this tool provides a convenient way for restaurant owners to find new locations that maximize the probability of success.
In order to make predictions about restaurant longevity, I needed a rich source of restaurant data that included both open and closed restaurants, as well as opening and/or closing dates. I can say with confidence that these data do not exist in a single public source. Instead, much of my project was focused on assembling data from a variety of sources.
Restaurant data came from the publicly available Yelp Dataset. This dataset includes records on approxiamtely 54,000 open and closed restaurants in the US, Canada, and Europe. Information on cuisine type, physical location, meal attributes, the restaurant's Yelp URL, and other details is included. However, information about restaurant opening and closing dates is not provided in this dataset.
Accordingly, I used the Wayback Archive API to find the earliest archive of each restaurant's Yelp URL. These dates served as estimates of restaurant opening dates. Approximately 60% of restaurants had no archived Yelp page, however.
I therefore used Google's Custom Search API to fill in more details. This API provided archive dates for each restaurant's Yelp URL, so I could use the earliest archive date to fill in values that were missing from the Wayback Archive. Google Custom Search also provided a sampling of reviews for each restaurant. I calculated the most recent review out of this set to estimate the closing date for closed restaurants.
Finally, I downloaded national demographic and housing data from American Fact Finder. These data were grouped by zip code.
The next step was to develop a machine learning model to predict restaurant success from these data. After trying a variety of classifiers including Logistic Regression and a Ridge Classifier, I settled on using a Random Forest Classifier in Scikit-learn. Categorical variables such as the restaurant attributes (alcohol served, good for groups, etc.) were converted to quantitative dummy features using the DictVectorizer and CountVectorizer in Scikit-learn. I used RandomSearchCV to tune the hyperparameters of the Random Forest Classifier, and achieved an accuracy rate of about 71% on test data. The relative importance of each of the features is illustrated in the figure below (determined from the feature_importances_ attribute of the classifier).
I was not surprised by the relatively low accuracy of the model given the difficulty of the classification problem. In fact, achieving high overall accuracy is not required for my application, as I am only interested in the model's ability to predict the best restaurant locations. The model only needs to be accurate for the predicted successes that it is most confident about. Accordingly, I examined accuracy as a function of the predicted probability of success. These results are summarized in the figure below.
As can be seen in the figure, accuracy is quite high when the predicted probability of success is high. This suggests that when the model is confident of success, its predictions are accurate. In other words, the high-confidence restaurant location suggestions provided by the model are good ones. Of course, the restaurant business is a cuthroat one in the Bay Area, and the difficulty in achieving a high probability of success in this market is a reflection of that. Nevertheless, this tool provides a clear benefit to entrepreneurs looking to open a restaurant in the area.
All code for this project is available at https://github.com/justinmacdonald/tdi-capstone.