Data science processes in the context of machine learning and AI can be divided into four distinct phases:
From my experience, the most impeding phases are the data acquisition and model deployment phases in any machine-learning-based data science process, and here are two ways to optimize them:
1. Establish a highly accessible datastore.
In most organizations, data is not stored in one central location. Let’s just take information related to customers. You have customer contact information, customer support emails, customer feedback and customer browsing history if your business is a web application. All this data are naturally scattered, as they serve different purposes. They may reside in different databases and some may be fully structured and some unstructured, and may even be stored as plain text files.
Unfortunately, the scatteredness of these datasets is highly limiting to data science work as the basis of all NLP, machine learning and AI problems is data. So, having all this data in one place – the datastore – is paramount in accelerating model development and deployment. Given that this is a crucial piece to all data science processes, organizations should hire qualified data engineers to help them build their datastores. This can easily start off as simple data dumps into one location and slowly grow into a well-thought-out data repository, fully documented and queriable with utility tools to export subsets of data into different formats for different purposes.
2. Expose your models as a service for seamless integration.
In addition to enabling access to data, it’s also important to be able to integrate the models developed by data scientists into the product. It can be extremely difficult to integrate models developed in Python with a web application that runs on Ruby. In addition, the models may have a lot of data dependencies that your product may not be able to provide.
One way to deal with this is to set up a strong infrastructure around your model and expose just enough functionality needed by your product in order to use the model as a “web service.” For example, if your application needs sentiment classification on product reviews, all it should need to do is invoke the web service, providing the relevant text and the service would give back the appropriate sentiment classification which the product can directly use. This way the integration is simply in the form of an API call. Decoupling the model and the product that uses it makes it really easy for new products that you come up with to also use these models with little hassle.
Now, setting up the infrastructure around your model is a whole other story and requires a heavy initial investment from your engineering teams. Once the infrastructure is there, it’s just a matter of building models in a way that fits into the infrastructure.
机器学习和人工智能背景下的数据科学过程可以分为四个不同的阶段:
根据我的经验,在任何基于机器学习的数据科学过程中,最阻碍的阶段是数据采集和模型部署阶段,这里有两种优化它们的方法:
1. 建立高度可访问的数据存储。
在大多数组织中,数据并不存储在一个中心位置。我们只获取与客户相关的信息。如果您的业务是Web 应用程序,则您拥有客户联系信息、客户支持电子邮件、客户反馈和客户浏览历史记录。所有这些数据自然是分散的,因为它们服务于不同的目的。它们可能驻留在不同的数据库中,有些可能是完全结构化的,有些可能是非结构化的,甚至可能存储为纯文本文件。
不幸的是,这些数据集的分散性极大地限制了数据科学工作,因为所有NLP 、机器学习和人工智能问题的基础都是数据。因此,将所有这些数据集中在一个地方(数据存储)对于加速模型开发和部署至关重要。鉴于这是所有数据科学流程的关键部分,组织应该聘请合格的数据工程师来帮助他们构建数据存储。这可以很容易地从简单的数据转储到一个位置开始,然后慢慢成长为一个经过深思熟虑的数据存储库,完整记录并可使用实用工具进行查询,以将数据子集导出为不同格式以用于不同目的。
2. 将您的模型公开为无缝集成的服务。
除了能够访问数据之外,能够将数据科学家开发的模型集成到产品中也很重要。将用Python开发的模型与在Ruby上运行的 Web 应用程序集成起来可能极其困难。此外,模型可能具有许多您的产品可能无法提供的数据依赖性。
解决这个问题的一种方法是围绕模型建立强大的基础设施,并公开产品所需的足够功能,以便将模型用作“Web 服务”。例如,如果您的应用程序需要对产品评论进行情感分类,那么它所需要做的就是调用 Web 服务,提供相关文本,该服务将返回产品可以直接使用的适当情感分类。这样,集成就只是API调用的形式。将模型和使用它的产品解耦,使您想出的新产品可以轻松地使用这些模型。
现在,围绕模型设置基础架构完全是另一回事,并且需要工程团队进行大量初始投资。一旦基础设施到位,只需以适合基础设施的方式构建模型即可。
联系人:余工
手 机:198-5307-5821
邮 箱:batteryltd@sina.com