Large Databases for Strategy Research


This case illustrates how to identify and apply existing data sets as an alternative to conducting time-consuming primary research. In particular, with the advent of big data and the explosion in the availability of useful data, more and more existing data will make itself available to researchers, particularly in the business disciplines. The case covers the application of readily available financial data to research on the intangible assets (knowledge assets) of organizations. A well-known metric, the Tobin’s q, is evaluated, including different forms and applications, providing insight into decisions to be made about how existing data best fit a particular research study. Extensions are provided, illustrating how data can be enhanced over time and in different applications. In this case, competitive intelligence activity is combined with the knowledge development (Tobin’s q) data as is information on big data capabilities. Guidance is provided on data combination, the added potential from such combinations, and the limitations. As in all research, one of the key points is to understand the origins of the data, the meaning, and how best to apply them.

Learning Outcomes

By the end of this case, students should be able to

  • Understand how to access and apply existing data sets, including big data
  • Recognize the benefits of using existing data sets and how to use them to their advantage
  • Recognize the disadvantages and limitations of using existing data sets, as well as how to limit their impact on research projects

Background: Increasing Data Availability

Although the research methods covered in this case can be applied to a wide variety of subjects and applications, the specific example comes from strategic use of organizational knowledge and information. One prominent trend affecting that field is the growth in the generation, capture, and use of big data and business intelligence. Much more data, information, and even knowledge are available for use, especially in business applications.

The phenomenon of big data is often explained by reference to the “3 V’s”: volume, velocity, and variety (Laney, 2001). The amount of data collected and processed is exploding (volume); data are more rapidly available to users, in real time in many cases (velocity); and data can be collected in an increasing number of formats, including unstructured such as text, images, and video (variety). The key for researchers is that considerable data are more and more available, perhaps precluding the need to conduct primary research.

In particular, modern information systems collect a lot of observation data. In a business context, this includes supply/distribution logistics, operational performance, marketing relationships (transactions and other touchpoints), and communications, including digital media. This opens up a wide range of opportunities as observation data can be more quickly gathered, are often more accurate (observing behavior rather than asking about it), and enable huge samples or even entire populations. When communication data are also appropriate, these can be more easily matched with observations and also gathered more quickly and in larger numbers, particularly when ongoing relationships (members, loyalty programs) are used as contact lists for sampling.

Consequently, academic researchers will find increasing opportunities to go beyond traditional smaller sample communication methods such as interviews, focus groups, and surveys. Larger sample existing data may provide new opportunities and can be more relevant to research questions, particularly when compared with common techniques such as undergraduate student or Mechanical Turk sampling.

This case does not include big data so much as a large, secondary data set constructed and used for cross-firm and cross-industry comparisons. Supplemented with other repurposed, existing data, the case illustrates the usefulness of such approaches, how they can be applied, and how they could also be employed with even larger, big data sets.

Project Background

This specific case concerns metrics on intellectual capital (IC) or knowledge assets in organizations. Over the past couple of decades, the IC and knowledge management (KM) disciplines have studied the intangible assets of the firm—essentially the unique knowledge possessed by employees. This goes beyond such long-standing and formalized intangible assets such as patents and copyrights, bringing in harder-to-identify and harder-to-evaluate knowhow accumulated by employees over time.

The general logic of both fields is that by accurately assessing IC and then better managing the knowledge assets, firms can gain competitive advantage over rivals. In a play on the well-known resource-based view of the firm, this has been referred to within the disciplines as the knowledge-based view, suggesting that the unique knowledge held by an organization’s employees is what differentiates it from competitors.

So a great deal of time has been spent in defining types of knowledge, identifying variables that influence successful KM (nature of the knowledge, nature of the firm), and identifying KM systems most appropriate for different circumstances (e.g., information technology, communities of practice). Much of the academic work has centered on single firms or a small group of firms where this type of micro-level data can be gathered, often through interviews and surveys, supplemented by line items in firm accounts.

Indeed, a taxonomy developed by Sveiby (2010) includes more than 40 published methods for assessing the IC (knowledge assets) of the firm. Organized by whether a dollar-based metric (or not) and by whether it adds up the components of the firm or looks at the organization as a whole, the results show a clear emphasis to the component approaches. These require deep access to an organization. Many approaches resemble financial annual reports (Skandia Navigator, Balanced Scorecard) with the attendant detailed collection of data throughout the firm. While quite useful to those planning KM initiatives and/or to scholars, they are hard to conduct across a larger number of firms. In comparing a particular company with competitors, one would be hard-pressed to gain similar information as not all participants would be willing to provide an outsider such access.

Consequently, using KM or IC as a strategic planning tool is more difficult. Comparing competitors’ levels of IC and/or competitor effectiveness in managing such knowledge assets cannot be done if internal company data are required. Or at least it can be done only in very unique circumstances. Conducting such research broadly, across numerous firms and numerous industries, is impossible without a different approach.

The motivation for the research program described here was to explore strategic approaches to developing IC. The early years of the century saw some large investments in massive KM systems. Often based on big information technology systems, the high-profile KM installations sometimes worked but sometimes were also perceived as failures. It raised the question of whether maximum spending on KM was always optimal. Furthermore, our previous research had uncovered a balance between distributing knowledge as widely as possible throughout the organization and its extended network versus the need to protect such valuable intangible assets from the prying eyes of competitors. Again, is maximum KM necessarily the best choice? Or might circumstances demand a different approach?

To investigate whether investment in KM systems might be more strategic than generally believed, we sought to compare the level of IC by firm and by industry. The proposition was that some industries would show high levels of knowledge development (requiring substantial KM investment to keep pace) while others would show lower levels. While managing knowledge better than competitors is always something to which to aspire, what is needed to achieve that objective probably varies, perhaps significantly.

To do so, we turned to the literature. Going back to the Sveiby taxonomy, the Tobin’s q is a dollar-based approach available at the organizational level. The original idea (Tobin & Brainard, 1979) was to compare the value of the firm (market capitalization) with the tangible assets of the firm (replacement cost of assets). Conceptually, any of the firm’s value that could not be explained by hard assets must be due to softer, intangible assets. In KM/IC circles, these intangibles are seen as knowledge assets/IC, or at least a good proxy for them.

In practice, the original Tobin’s q is not much used; the replacement cost figure is hard to get. But annual financial reports do contain the book value of assets (asset cost less depreciation), a variation on Tobin’s q that provides similar information to the original. And, as we will discuss, financial reports also include liabilities. A further variation on Tobin’s q that might be of interest, book value of assets less liabilities, is also pretty easily obtained. Both can be gathered for a large number of firms in a large number of industries, allowing direct comparisons of the estimated value of IC/knowledge assets held in each. From there, we can start to ask and answer strategic questions.

Conceptualization and Methodology

The initial project was to construct a framework organizing industries by the level of knowledge asset development required to be competitive versus the threats faced from competitive intelligence (CI) activity. Such a framework could provide strategic guidance to decision makers in terms of how aggressively to invest in knowledge development (and what industry practices looked like). Similarly, industry practices related to CI can guide top management in terms of whether they need assertive intelligence against competitors and whether to install counterintelligence practices to protect their own knowledge and information.

The knowledge development metrics followed the Tobin’s q variations just discussed. Both market capitalization to assets and market cap to assets less liabilities were collected. The former shows what a firm can accomplish beyond its current level of tangible assets. If two organizations each have US$1 billion in tangible assets, but one has a market cap of US$3 billion while the other is only worth US$1.5 billion, something different is happening which can be attributed to the intangibles.

The second Tobin’s q variation, however, provides a similar picture but focuses on the tangibles actually owned by the firm, free and clear of debt or other claims. This can be important in industries such as financial services where banks and others have huge levels of tangible financial assets (providing what might be an artificially low market cap to asset number), most of which have been loaned by depositors, investors, or others. Subtracting out those liabilities gives a truer indication of the firm’s capabilities in some industries. We gathered data for both variations, reported both variations in most applications, and the metrics agreed in most instances. In the few where they do not, they provide deeper information on what is actually going on with firms in those industries, alerting us to unique circumstances such as really high (or low) debt levels.

Another issue is what version of the Tobin’s q figure is actually calculated, as it can be a subtraction (market cap less assets) or a ratio (market cap over assets). The former gives a sense of the size of the intangible assets in a given firm and the average over an industry. But we usually report the latter. It takes size out of the discussion so that a much larger firm does not appear more successful just because of that size. Firms of all sizes are directly comparable when a ratio is used, though drastic differences can still render results with dramatic differences not due to intangibles (e.g., inflated market caps for small “unicorn” firms).

To apply these metrics, data were collected from annual financial reports (I/B/E/S originally, a later update through Compustat). The actual download took less than an hour, only a matter of setting the data to be used such as market cap and assets as well as the time period. The original study covered all firms traded on North American stock exchanges 2005–2009, replicated later for 2010–2014. Once the data were in hand, they were sorted by revenue, and the data set was truncated to observations with annual revenue greater than US$1 billion. In both date ranges, this left over 7,000 observations and nearly 2,000 firms—not all firms passed the revenue limit for all 5 years and the data set also had firms entering and exiting due to merger activity and similar events.

The remaining data were then sorted by Standard Industrial Classification (SIC) number, firm, and date. This meant that industry was determined by the self-reports in the financial documents, not a perfect solution but probably the best available. But, as a result, the data were then amenable to calculating the adjusted Tobin’s q metrics for each firm for each year. These could then be averaged for the full 5-year period for each firm and for each industry.

The result was a catalog of these metrics assessing knowledge levels by industry (and by firm, if desired). We have used it to compare the full range of industries (Erickson & Rothberg, 2012) and in industry-specific studies such as health care (Erickson & Rothberg, 2017), financial services, retail, and consumer goods. Although a time-consuming process to clean the data, sort it effectively, and plug in all the necessary calculations, the compilation is then available to answer individual research questions. It is also amenable to longitudinal studies as a second and future databases are constructed.

Supplemental Data and Methodology

As mentioned, the research concept included some measure of the competitive intelligence (CI) efforts in each industry as well, so not only development of knowledge but also vulnerability required protection for the acquired intangible assets. For that, we had no guidance from the literature. In a pilot study, we used the membership rolls of the Society for Competitive Intelligence Professionals (SCIP), organized by firm and by industry. A raw count by industry gave us some data.

But we found better. Fuld & Company, a CI consultancy, has for many years conducted a survey of workshop attendees and other contacts. Consequently, for the same years as the knowledge asset data set (2004–2009), we were able to construct a CI metric based on responses from almost 1,000 CI professionals worldwide. The Fuld data included not just company, allowing us to count on a per company and per industry basis, but a self-report question on the maturity/professionalism of the CI operation. This essentially ranged from one-person, startup CI initiatives to highly experienced, sizable operations. Combining numbers and professionalism allowed construction of an index identifying highly active industries and differentiating them from those with no identifiable CI activities. We will discuss the point more later but note that for this metric, we did not tend to use any results for specific firms. The knowledge asset database is a full population of firms fitting the criteria (listed, revenues above the cutoff). The Fuld survey can identify industries with activity, but whether a CI professional from a specific firm shows up is more of a matter of chance. It needs to be treated more as a sample or even a proxy for CI activity in an industry, not at the full population of firms.

An illustration of the results is included in Table 1. We sorted results for all industries into a two-by-two matrix, identifying circumstances where knowledge requirements are high or low and CI activity is high or low. Selected health-related industries are presented here. The mean for knowledge development (market cap/assets is used here) is noted. The CI index is not as amenable to a mean, so we tended to treat double figures on the metric as “high” and single figures or zero as “low.” As we have never used the framework to make precise distinctions, this does not diminish the power of the analysis, one can see the clear difference in results in most cases, including here.

Table 1. Illustration of knowledge metric versus CI metric.

Knowledge low (<1.02)

Knowledge high (>1.02)

CI activity high

Health insurance

KM = 0.80

CI = 36


KM = 1.94

CI = 64

CI activity low


KM = 0.61

CI = 3

Drug retail

KM = 1.23

CI = 1

CI: competitive intelligence; KM: knowledge management.

So an industry such as pharmaceuticals has a very high knowledge metric, suggesting that knowledge development is important for competing in this sector. Strategic decision makers should invest accordingly. Similarly, CI is rampant, so counterintelligence to protect that valuable knowledge is important as is one’s own CI operation to use against competitors. Alternatively, hospitals have low scores for knowledge development and CI. This may sound odd, given the number of highly educated participants but knowledge development is all about being able to effectively share what you know, and doctors are famously reticent to do so, especially participating in systems documenting organizational knowledge. Without valuable knowledge captured by the organization, there is little need for CI.

The even harder to explain results are in the other two cells. Low knowledge but high CI (insurance) and high knowledge but low CI (drug retail) are indicative of circumstances where other variables come into play such as the firm specificity of knowledge, the rarity of new knowledge/learning, and other details from the literature. The unexpected results are what make for interesting papers and it helps to be able to compare and contrast them to the fuller data set. And, if the results are to be of use to decision makers, these clearly indicate where investment should be put into knowledge systems (retail) versus where CI needs to be aggressive (insurance and other financial services).

As a final point to consider, we have more recently incorporated big data and business intelligence into the database. As long as a supplemental database is comparable, more data can be added. In this case, McKinsey Global Services constructed a database of big data holdings (terabytes of storage by firm by industry; Manyika et al., 2011). The categories are not as precise (product manufacturing, process manufacturing, retail) but can be applied across the more specific industries in our database (e.g., all retail, all financial services). Consequently, in more recent work, we have been able to suggest not just the importance of investing in knowledge but the difference in investing in big data versus specific types of knowledge (explicit vs. tacit) versus intelligence.

Practical Lessons Learned

A number of conclusions can be drawn from our experience that can guide future research. Although the databases described are not big data by any means—they can be fully captured by an Excel spreadsheet and actually require some line-by-line attention and processing—the lessons are applicable to the increasing potential seen in big data research. First and foremost, the data were readily available and did not require additional collection. No additional samples, surveys, or time-consuming processes are necessary.

In doing so, however, it is important to ensure the data are appropriate to the intended application. In this case, the ability to bring in literature justifying use of this particular metric (Tobin’s q), especially from a Nobel laureate, provides considerable credibility. Similarly, we have never had any issues using the big data number from McKinsey Global Services as, again, there is some credibility already built in. In the case of the CI metric, however, we had to invent our own specific measurement. In fact, we have gone through a few variations over the past decade, so it is important to be able to justify the “imported” metric as a valid measure. In this case, CI has not been a widely researched topic, it is more practitioner-oriented, so we had some flexibility and credibility of our own. But it is important to be able to fully justify the metric.

One should also fully understand the metric, both its strengths and its weaknesses. Once again, this is much easier if you can draw insight from existing literature. The Tobin’s q has obvious advantages but needs to be assessed in terms of the level of tangible assets required for an industry, the debt/asset claims issue discussed earlier, and other peculiarities. If one can draw some of those out of existing literature, the means of handling the potential issues can be much more convincing. One way to handle this is to use multiple metrics. As mentioned, regarding Tobin’s q, we often report both market cap/(assets less liabilities) and market cap/assets. In the vast majority of cases, both agree, but any differences between the two metrics generally provide an additional opportunity to take a closer look at the specific case and insights as to why those circumstances might be different. The enriched explanations are often useful in really understanding what is happening in those circumstances and what additional variables might be affecting the results.

In terms of specific advantages of these approaches, the ability to examine an entire population, rather than a sample, is important. It removes the need to conduct significance tests. This is common in big data. When an organization tests a promotional offer, for example, across its full population of customers and determines that a greater percentage act on it, the firm does not need to determine statistical reliability. A 1% increase across all customers is a 1% increase. So if the big data are truly big data, the reliability of the database is assumed.

Another advantage is the ability to attach additional data. Longitudinal repetitions are doable and, if the existing data are collected repeatedly, easy. Or, as noted earlier, if additional data make sense to add to the constructed database, it can be done readily if in a comparable form. This factor can be extremely useful if conditions change, such as the growth of interest in big data as another intangible asset since this research program began. Rather than having to ensure that every last possible question is answered in a particular survey, when administered, data inputs can be enhanced after the fact.

On the downside, one needs to be careful to take the data as they are. The specific metrics were not constructed for the research project at hand. They may fit the research question but not perfectly. Understanding the idiosyncrasies of specific measures is important, such as whether to use market cap/(assets less liabilities), market cap/assets, or both, as well as understanding the differences between those choices. Similarly, recognizing the precision of the data can make a difference. As pointed out, the CI data set is unique and valuable but may or may not include specific firms. So trying to employ it to discern fine differences between industries is problematic. A difference of a 64-index value (pharma) versus 0, 1, or 3, as in the example, is convincing. A difference of 1.30 versus 1.25 in the knowledge metric is not quite so clear. Trying to draw conclusions from slight differences is asking for trouble, particularly because statistical significance can be extremely hard to determine.

On a related topic, we noted the fact that databases sometimes include entire populations. While true, that does not mean that using portions of the population is convincing in the same manner. In our work of comparing across different industries, we are not comparing populations of 7,000 observations. We are comparing industries with, perhaps, 25 observations over the 5-year period. Consequently, we have never reported results from the Tobin’s q database for an industry with less than 20 total observations. Pharma has dozens, so do the various financial services industries. But when the numbers dip below 20, that industry just does not have enough juice to be included in any comparisons that have value. As you divide the database into subsets, realize that you lose some of the advantages of large numbers that can come with big data.


Using existing databases to compare multiple firms across multiple industries has tremendous potential to contribute to research in numerous applications, particularly in addressing strategy issues in business. With the rise in big data collection and availability, researchers will increasingly see existing secondary data as potential sources for interesting research.

That potential is real. Even so, it is important to fully understand how the available data are collected and what they mean. Existing data may fit a research program exactly or they may need some adjustment. Even if they do fit, they may need clear and convincing explanations of the idiosyncrasies and meaning in particular applications.

But with a full understanding of the data and their limitations, using secondary databases creates opportunities for conducting longitudinal studies, for combining data sets, and for building a structure that can be used for multiple studies/publications. The bottom line is that targeted surveys or other methodologies do not necessarily need to be employed in all circumstances. There may be better sources of data without the need to recreate the data collection.

Exercises and Discussion Questions

  • Describe the advantages of using already collected data rather than primary data you might collect yourself.
  • What are the disadvantages of previously collected data versus primary data you collect yourself?
  • As in the case, consider what readily available data (financial or other) you might apply to a research question. What additional data might you like to gather and add to that data for further insight?
  • Major providers of big data include such well-known firms as Amazon, Netflix, Spotify, Google, and Facebook. If you had unfettered access to their data, what kind of research could you do?
  • In specifying data in the previous question, what unique insights might you be able to discover? What might be the limitations?

Further Reading

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36, 11651188.
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review, 90, 6066.


Erickson, G. S., & Rothberg, H. N. (2012). Intelligence in action: Strategically managing knowledge assets. London, England: Palgrave Macmillan. doi:10.1057/9781137035325
Erickson, G. S., & Rothberg, H. N. (2017). Healthcare and hospitality: Intangible dynamics for evaluating industry sectors. Service Industries Journal, 9, 589606. doi:10.1080/02642069.2017.1346628
.Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. Retrieved from (accessed 1 November 2013)
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung Byers, A. (2011). Big data: The next frontier for innovation, competition and productivity. New York, NY: McKinsey Global Institute.
Rothberg, H. N., & Erickson, G. S. (2017). Big data systems: Knowledge transfer or intelligence insights. Journal of Knowledge Management, 21, 92112. doi:10.1108/JKM-07-2015-0300
Sveiby, K.-E. (2010). Methods for measuring intangible assets. Retrieved from (accessed 4 April 2012).
Tobin, J., & Brainard, W. (1977). Asset markets and the cost of capital. In R.Nelson & B.Balassa (Eds.), Economic progress, private values, and public policy: Essays in honor of William Fellner (pp. 235262). Amsterdam, The Netherlands: North-Holland.
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles