Microsoft PowerPoint is used by all kinds of organizations to share information about plans, projects, strategies, and status updates, among many kinds of content. Organizations often create standard PowerPoint templates for their people to use in creating the content. PowerPoint is convenient in that someone can input data and it's immediately available for presentation purposes.
As a result, vast amounts of corporate knowledge lie in PowerPoint documents. While in some cases the documents can be searched easily, comparing data across documents in meaningful ways is a very tedious and time-intensive task. If a company has 50 project plans in PowerPoint documents created from the same template, it is not easy to determine, say, the average duration of a project or how many projects are over budget.
As the lead architect of robotic process automation (RPA) at SAIC, I helped to analyze data in PowerPoint documents that laid out business growth plans over multiple time horizons. The documents came from a common PowerPoint template and included text descriptions as well as numerical information, such as market size, growth rates, expected revenues and margins, and so forth. The numerical information was conveyed in text, tables, and bar charts.
Enter robotic process automation
The objective was to avoid copying and pasting data across the documents. I wanted to find out whether RPA could be used to effectively automate this process. If so, perhaps RPA could be used to help unlock the knowledge contained in other kinds of PowerPoint documents. The proof-of-concept was to use RPA to extract the data to a Microsoft Excel spreadsheet and use that to perform data analytics.
I created an RPA bot in UiPath, an RPA software platform. I discovered that people in our company were pretty consistent in populating our template-based documents correctly and did not modify the templates or add unexpected charts. So, the RPA bot worked well.
I then sent the database to data scientist colleague Sergio Rego. Looking through the data, Rego immediately ran into challenges with style inconsistencies in the data between the documents. For example, people wrote “$3,000,000,” “$3M,” or “$3,000 (in thousands).” This was not unexpected since PowerPoint doesn’t have data validation, so without guidelines any group of people was likely to be inconsistent with inputs when using a standard form.
The style inconsistencies between documents had to be rectified in the data wrangling process. And as is the case in many data analytics projects, data wrangling was the most time-consuming activity, taking longer than the bot development and execution. Once Rego standardized all the data and used Jupyter Notebook for the analysis, he was able to generate different views of the data, including Monte Carlo analysis to predict ranges of expected financial outcomes for the business projects.
Rego ran into even larger problems with the text data. Given that PowerPoint is a presentation medium, people inputted data in ways that made the most sense to them for presentation. This led to variances in the text data, and Rego would have had to interview the people who made the inputs in order to develop a taxonomy or an ontology that standardized the data. This was not done, since it would have made the proof-of-concept a more expensive and time-consuming activity.
Nevertheless, Rego was able to perform basic word frequency analysis on the narrative inputs and construct world clouds to identify themes. Moreover, he did a linear discriminant analysis (LDA) to segment the data into topics with associated words. This was helpful in identifying general trends and patterns.
The proof-of-concept team got together for a hot wash and determined that RPA could be used to effectively mine knowledge from PowerPoint documents created from the same template so long as the document creators were consistent in how they populated them. Any organization that wants to attempt this activity should expect to do a significant data wrangling effort in order to maximize the value of the extracted data.