Finding startups — notes on work as a new data scientist in an early-stage VC fund

(Co-authored by Paul Meinshausen )

This past month I joined Montane Ventures , an early-stage Venture Capital fund, as a data scientist in Bangalore. I joined Montane for the opportunity to work on intriguing data and tech problems with several great startups, and because I like the signal-to-noise ratio in early-stage startups building products with small teams.

Part of my time will be spent with startups that Montane has invested in (or is working with towards an investment); helping them escalate their data science efforts and capabilities. I’ll also spend time doing the more traditional work of a VC analyst: helping find and identify startup prospects for investment.

Venture Capital is typically about backing companies that are using technology in ambitious and innovative ways. The team at Montane Ventures believes that the work of VC investing itself could be improved with ambitious and innovative tech. With all of the ways technology and the use of data science has developed over the past few decades, it’d be strange if we identified promising investments in the same way investors did twenty or thirty years ago. That’s a big part of the reason I was hired; and it’s something I’ll be working on closely with the team.

To make data-science software an effective tool for VC is itself ambitious and difficult (and there are several other funds around the world that are more or less openly engaging in similar efforts). At Montane, we are exploring multiple ways to incorporate data science into venture investing. This post is an effort to trace out some of our thinking specifically on what’s involved in using data science to assist with what I identified above as my second major responsibility: Identifying startups for investment.

In the rest of this post I’m going to

  1. Explain how venture capital can benefit from data science,
  2. Describe why and how the data we use gets generated.
  3. Outline the two major areas where data science can be applied, and
  4. Share some of our specific hypotheses for how data science will add value

Investin Ain’t Easy

So why do you need data science to find startups for investment? From the outside it might seem like VCs have it easy: investors have capital; most (all?) startups want capital; the investors just have to choose who gets it and then make their investment.

Only it’s not that straightforward. Imagine investors have some amount of time that they have to do all of their work within. Now subtract the time that they need to spend raising new capital and managing their fund and working with the companies they’ve already invested in. We’re left with the time remaining for them to find startups to make new investments in. Putting it in this basic way helps to specifically identify our objective: Efficient use of time. We can even turn this into a very simple equation:

I = T — (R+P)

I is the time allotted to making new investments in startups, 
T is the total time available, 
R is time needed to manage the fund administratively and raise new capital, and 
P is the time needed to work with and support the fund’s portfolio (startups already invested in).

The fund needs to allocate I across the available population of startups looking for investment. If there were, say, 10 startups ready to raise capital each year, then you could allocate about 10% of your time to each startup (or less), make your decision about which to support, and then focus on the 1 or 2 startups that you choose. There is a heck of a lot more than 10 startups looking for investment. There are hundreds or thousands. The size of the population of startups is part of what makes this allocation problem difficult.

Pictures are another helpful tool in clarifying a problem. The picture below conceptually models the investment decision process as a series of sequentially smaller subsets:

Finding startups — notes on work as a new data scientist in an early-stage VC fund
Finding startups — notes on work as a new data scientist in an early-stage VC fund

Since the set of startups invested in is a subset of the startups you’re aware of, the quality of the eventual investment subset is constrained by the quality of the initial population.

Now to be clear, no one knows the precise size of the total population of startups in an ecosystem (let’s use India as the example — we’re leaving aside the problem of ecosystems themselves being fluid and hard to demarcate). But we can think of the theoretical total population as the dataset for our allocation problem. And the first challenge is collecting our dataset (and keeping it current), which will remain a sample of the total real population and hopefully approach it as closely as possible. We’ll call this part of the problem Scanning .

Continuing with the startup ecosystem in India, this estimate puts the population of tech startups in India at approximately 4,750. We will use this number as a starting point. Startups are an interesting inventory problem because of the flux in the market. They only raise capital at certain points, and if you’re a fund that focuses on a particular stage, then you can really only invest in that startup at a subset (e.g. Series A) of a subset (the intermittent time periods they’re raising) of the time each startup exists as a startup.

We want as many startups in our dataset as possible since this is the sample that we’re going to choose from. If and when there are startups that we should have considered for investment that weren’t in our initial population dataset, we have a data-collection and sample-bias problem.

If and when there are startups that we should have considered for investment that weren’t in our initial population dataset, we have a data-collection and sample-bias problem.

The next step is choosing the subset of startups that we’ll spend time closely considering. We’ll call this part of the problem Filtering . We want our filtering function to be as good as possible: avoiding false-positives (startups we let through the filter and then spend too much time considering when they aren’t a good fit for our fund) and false-negatives (startups that don’t make it through the filter but which actually had good investment potential).

The next two sections sketch out the Scanning and Filtering problems.


Developing data science software for VC is not about automating VC (just to be clear!). Much that investors do isn’t going to be replaced by machines. Personal relationships cultivated over investors’ long tenures result in inbound referrals that are often a major advantage for established VC funds. In-bound traffic (startups that reach out to an investor rather than the opposite) in general is an important part of a VC pipeline. Besides personal references, inbound traffic is driven by general reputation from online and offline media like stories, interviews, and funding announcements. It is a VC’s online presence that makes a fund discoverable by startups searching for investment.

We don’t want to replace any of the above. We just think that leaving the filtering dataset limited to inbound will leave us with a deficient dataset — a non-inclusive and non-representative sample of the actual population of startups.

In addition to sourcing startups through inbound activities, the internet and online sources are already an important part of any investor’s search process. The limitation is that for many investors the search process is mostly manual. An investor will read startup news and social media, etc, and come across startups that they decide to follow up with for further investigation, or assign an analyst or associate to do the follow-up. Data science can help build a supplemented search process to make our filtering dataset as comprehensive as possible.

Internet sources for data collection can be divided into several categories.

  • Aggregator sites are the most direct and substantial source. They present data in structured format and they try to be as comprehensive and detailed as possible. The two most well-known sites that are on the more open side are Crunchbase and Angellist . Sites like Tracxn are good examples of paid/subscription sites and services.
  • Funding announcements are another strong source. These are less structured than aggregator-sites, but often they’re the first public documentation of a funding event. VCCircle in India is one site that has semi-structured funding announcements. YourStory is another.
  • Accelerators/Incubators are a decent source of early stage startups. If there were only one or two of these, you could manually track them fairly easily. However there are loads of these and they have new cohorts every few months; ideally you don’t want to have to remember to manually check each site every few months to find out about a startup.

Not all information will be available online. However, there are three reasons why startups will share information about themselves online, where we can find it efficiently:

  1. They need to raise capital and they want investors to find them
  2. They need to hire and they want employees to find them ;
  3. They need to sell and they want customers to find them

Any new company is going to be operating under at least two of these three motivations.


Once we have a dataset of all startups that are broadly viable targets for an investment within the next ~6 (or N) months, the next task is filtering them.

Much of filtering and the final decisions about investments have to be done using human investor expertise and insight. We don’t expect to have software make final decisions about an investment. So it’s useful to be more precise in our objectives. We have formulated two, which we still consider hypotheses rather than conclusive objectives:

  1. Be able to make more systematic comparisons between startups rather than zeroed-in yes-or-no decisions on each individual startup on its own.
  2. Be made aware of red flags or areas that need close attention earlier in the investment process.

1. There is value in making more systematic comparisons between startups rather than isolated yes-or-no decisions on an individual startup.

This first objective is one that we’ve developed through reflection on our experience making investments. The investment process for each startup typically occurs in way that seems sequential and divergent. During the time where you are not deep into an investment due diligence process you are regularly meeting and evaluating startups. Once you have decided a startup looks interesting, you usually begin to spend more time with them, considering their thesis and business model. As you spend more in-depth time with a startup, you have less time to do intro meetings and exploratory assessment of additional startups in parallel. The longer you spend with a startup, the more it diverges from alternative investments and comparisons become less frequent and useful.

This can be modelled as a kind of tunnel process, where the deeper down the tunnel you get, the harder it is to compare the startup with any other; since you’re thinking about all the details and context specific to this one startup.

There’s something deficient about this approach, because an investor isn’t really making a decision about whether to invest or not. They already have raised money and it’s presumed they will invest it. The question is really whether to invest (time and attention as well as eventually capital) in startup E or startup F (or startup G…). It’s always a tradeoff. And while you have to evaluate all the individual reasons for and against investing in startup E, the analysis should to the greatest possible extent involve explicit and systematic comparison between the available alternative investments. And the comparison should happen early and regularly through the investment process.

2. Becoming aware of red flags or areas that need close attention as early as possible in the investment process.

As we’ve developed our hypotheses and built our prototype processes, we like to identify very real and concrete and specific examples of the value we would add to the investing process. For our second hypothesis, we’ll share one example of a real startup in Bangalore. Just to note, this is not about condemning or criticizing a particular startup. It’s just that sometimes learning comes from explicitly acknowledging and paying attention to unfortunate stories.

The years 2015 and 2016 were a period when many fintech startups were being founded and funded in India. One of the startups from that period was a company in Bangalore called Finomena. They received a lot of attention and were backed by one of the major global VC funds active in India and for a while they were considered one of the best-known startups in the space. By 2018 they had shut down . While the precise reason(s) Finomena closed are not public knowledge, within the community it’s understood that there were problems with leadership and problematic management of capital and finances. This is the kind of outcome that VCs are keen to avoid when they decide to invest in a company.

So how would data science have played a role in gathering and identifying information that could have been analyzed to help detect these problems early on? As one example, Finomena’s employee reviews on Glassdoor are indicative of problems with management and possibly with the company’s culture. Reviews that consistently bring up the same complaints and back-and-forth arguments within the reviews between former employees and management are a warning sign. Once your data pipeline is set up, it’s a straightforward step to automate scanning Glassdoor for any companies within your dataset and quickly filtering against red flags.

As a caveat, not all red flags will show up in any particular place. Startups can artificially varnish their profile in individual places online. Despite the ability to manipulate a public image, there’s still a good case to be made for systematically and regularly scanning online sources for information and incorporating that information into your decision process.


At Montane we’re still in the early days of developing our work in this space. We’re excited to be a part of it, and to bring the tools and methods of data science to the world of VC. We’re going to keep developing our thinking and sharing examples of our work through posts like these. So if you’re interested in bringing tech and data science to Venture Capital, then stay tuned!