class: center, middle, inverse # Two Sigma RentHop Competition Matthew Emery [(@lstmemery)](https://github.com/lstmemery) June 1st, 2017 --- # Winning Kaggle Competitions by KazAnova 1. Understand the Data 2. Understand the Metric 3. Cross-Validate Early! 4. Hyperparameter Tuning .footnote[[Source](https://www.hackerearth.com/practice/machine-learning/advanced-techniques/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3/tutorial/)] --- # Who are Two Sigma and RentHop? - Two Sigma: AI Heavy New York Hedge Fund - RentHop: Smart Apartment Search (New York Only) - Reward: Recruitment to Two Sigma ![Two Sigma Salary](../images/twosigma_salary.png) .footnote[[Source](https://www.glassdoor.com/Salary/Two-Sigma-Salaries-E241045.htm)] --- # The Goal - Predict how interested people will be in this: ![Example RentHop Listing](../images/renthop_listing.png) --- # Understanding the Data Training: 49352 Rows Test: 74659 Rows - Location Data - Natural Language Data - Image Data (78.5 Gb compressed) - ...and everything you would else you would expect (price, bedrooms etc.) ![RentHop](../images/twosigma-renthop.png) --- # Understand the Metric Multiclass Log Loss (Low, Medium, High Interest)
- Note: This isn't ordinal --- ## Manager ID Count
Someone just used different transformations of Manager ID Count and scored in the top 15% .footnote[[Source](https://blog.nycdatascience.com/student-works/renthop-kaggle-competition-team-null/)] --- ## Listing ID
- This pattern hinted at a possible data leak... .footnote[[Source](https://www.kaggle.com/zeroblue/visualizing-listing-id-vs-interest-level)] --- ## Data Leak The creation time of the image folders were correlated with interest.
- X-Axis: Day - Y-Axis: Seconds - .blue[Blue=Low] - .green[Green=Medium] - .red[Red=High] .footnote[[Explanation](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32404)] --- ## Feature Engineering A few interesting ones: - Grouping by categorical features and finding count/median/mean/standard deviation of numerical ones. (3rd Place) - Inferring Points of Interest from text descriptions (Supermarket, Subway, etc.) (2nd Place) - Leveraging duplicate data (Leads and lags on pricing) (11th Place) - Exclamation marks in description - Reverse GeoCoding New York Neighbourhoods --- ## Second Place Solution @Faron ``` - 32 LightGBM models - 9 Extreme Tree models (sklearn) - 7 RF models (sklearn) - 5 Keras models - 3 XGBoost models - @KazAnova's StackNet example base-level predictions ``` Best Model: LightGBM (CV: 0.50135/ Test: 0.50557) Meta-modeled with a 2-layer neural network. --- ## An Aside on LightGBM ![LightGBM](../images/lightgbm.png) - Faster than XGBoost - Requires more hyperparameter optimization --- ## Second Place Solution Grid-Search Bagging Grid Search: Check cross-validation scores for each hyperparameter in regular intervals. e.g. Check maximum depth of XGBoost from 1 to 10. Bagging (Bootstrap AGGregating): Sample the data many times, with replacement For each of 12 bags: Grid search hyperparameters If the new hyperparameters is better, blend it into the model --- ## StackNet Written by Marios Michailidis (kazAnova) for his PhD A Java-based, flexible meta-modelling network ![StackNet](../images/stacknet_modes.png) [Source](https://github.com/kaz-Anova/StackNet) --- #References [2nd Place Solution](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/discussion/32148)