Usually data science type of interviews want to judge if you have enough skills in SQL, R, Python, Machine learning, optimization, and/or more programming techniques such as Monte Carlo Simulations, modeling in general and how best you would approach a problem ~ that is how well you think.
The better your thinking, the better your model. The better and easier your model, the easier it is to set it up, and solve it.
I am now going to ask you questions that I would ask someone if they were to interview with me. My views do not reflect the views of the place where I work, and I do not want to add any bias of places I have worked at before. An interview with respect to the place I work at will be completely different, but might* overlap a little with the questions I ask in the below.
Now, straight to the questions:
Just identifying problems:
- To give you an idea about a data scientist’s role, usually one has to first identify a problem. Tell me how you would go after identifying a problem?
- How would you calibrate that the problem you have identified is one of the most important problems to solve?
- How will you propose a solution to this kind of problem?
- How would you find out what are the top ten money draining projects that the State of Wisconsin has launched in last 5 years?
- A more specific question: Imagine you are working for a giant hospital that operates in multiple cities. You have a week to take a look at all the data available that the hospital has collected. The data are available in SQL servers and elaborate SQL queries can be written. How would you go after proposing to the managing director on what to focus on?
- I have LinkedIn data available for a geography ~ Say Seattle. Anyone outside of this Geography has been filtered out. Now with remaining 4 million people, I want to find out how connected this network is. How will you formulate this problem?
- Lets say you work for Walmart Labs. Walmart wants to identify which geographies are not doing very well and would like to identify a few things it can do to improve the sales. How will you identify which geographies are not doing well?
- How will you rank them?
- How will you drill this down to the level of ZIP codes?
- How will you identify profitable ZIP codes vs loss making ZIP codes?
- How will you identify if this problem is getting worse over time?
- Walmart has a standard floor design. How will you make it better for future stores?
- How will you propose changes to the existing floor plan of Walmart?
- How will you rank 50 such changes?
- How will you test them?
- How will you setup a pilot to test this, and then improve based off the pilot?
- Imagine that Walmart has X thousand cashiers working for all stores in US. We want to launch self check out lanes, and also want to hire cashier associates. How will you find out where an optimum point exists for a particular store. How will you scale this to the entire US network?
- We have demographic data available for every state, ZIP code. We have sales numbers available for all states, but not ZIP codes. How will you figure out approximate sales figures for each ZIP code?
- Now that you have figured out that some of the ZIP codes are most profitable. You have to propose the next best locations of 10 Walmart stores. How will you identify these locations?
SQL type of questions:
I want to calculate mean, standard deviation, count, of a metric. Then for each item in the list I want to calculate how much percentage deviation occurs with respect to the mean. Write a simple SQL program that achieve this. I can setup a very simple table to do this, and then ask how someone is approaching this.
Python, R, Machine learning etc:
- How would you go after identifying which machine learning algorithm works the best for a model problem?
- I have a set of data collected over last one year. With knn clustering algorithm I want to identify the top 100 clusters and filter out all the remaining clusters. How will you achieve this? What hurdles will you get into? (This is so open ended that when a candidate begins answering one can drill down on what they are thinking and how they are approaching the problem)
- How does random forest algorithm work?
- Lets say I have data representative of house prices. There are 80 variables that define a house’s price. I want to predict the prices of 1000 houses based off the available data from 2000 houses. How will you come up with a model? And which algorithms can be used?
- Explain how ‘one-hot-encoding works’.
- Explain how linear regression can be made better?
Check out this one on modeling. I would certainly prepare these and model
Integer programming type of problems:
- I have three variables that can take values between 0 and 1. They are given in three columns. They are multiplied by X, and then sum of that multiplication is subtracted from Y given in another column called Col_Sum.
- The sum of the variables must be less than equal to 2 for each row.
- The sum of the Col_Sum must be maximized.
- How will you formulate this problem and how will you solve this?
Monte Carlo type of problems:
- How will you write a Monte Carlo type of solver for traveling salesman problem?
Math and scientific computing, probability etc:
- Explain how Fourier transforms work?
- How does Naive Bayes algorithm work?
Hope some of these pointers help the people wanting to prepare for interviews like these. Usually the time available in an interview is very little, but as soon as an interview takes a direction, it is easy enough to continue to dive deeper and deeper all the way to the point that one can be slicing and dicing an algorithm.
- One general advice is that you should try to explain things in simple words, with simpler models, and simpler techniques. Simplicity is always preferred, and your interview will certainly have a smoother sail.
- If should be able to draw things on charts quickly, so you should practice this.
- Statistics is foundation of data-science, and whether people know machine learning or not, they know statistics far better, so testing that knowledge is easier even for non-programmers.
- You should be able to tell stories on how you solved problems in your previous jobs, because almost 60–80 percent of the interview can simply focus on it.
- That being said, you should certainly review ton of other questions I have posted about Data-Science, Python, Machine learning and so on.
- Review “top data-science questions” on various websites.
Stay blessed and stay inspired!