Performance Improvement in Ethics Assessment

Human systems with ethical issues have been observed to develop following the somewhat ad hoc CRIC cycle (Crisis, Response, Improvement, Complacency). Ethical issues become top of mind for the general public (and the legal profession) when egregious ethical failures come to light. Aggregate and individual progress in ethical performance can be difficult to predict in a CRIC cycle; and is often addressed only at points of drastic failures. Ethics is a branch of the humanities, with ethical performance typically assessed by manual human efforts using natural language enquiries. Widely accepted standards of ethical behavior can change over time as new norms of behavior become socially accepted. Assessments of ethical conformance by groups of people (e.g., an organization, a profession) are important to establishing, and maintaining, public confidence in that group. Training individuals to improve the ethical conformance of their behavior is currently widespread in many organizations and professions. The deployment of Artificial Intelligence (AI) systems is also driving demand for both an increasing number of ethics assessments (due to the increasing number and variety of AI systems) and a need for ongoing ethics assessments as these systems can learn and modify their behavior during operation. The CRIC cycle is ill-suited to these needs for ongoing improvements in ethical performance.

Rather than perpetuating CRIC, more systematic quality improvements can be achieved through continuous improvement quality cycle approaches – e.g., the Plan-Do-Check-Act (PDCA) cycle. The PDCA cycle approach aligns with technological approaches for software performance improvements. Applying this approach in the context of ethical performance for software systems requires consideration of the appropriate metrics for ethics, and the relevant measurement and testing procedures for assessing ethics performance. Much of the ethical guidance – whether for AI systems or legal professional is captured in high level principles (e.g., Jobin’s synthesized principles (Jobin, et. al., 2019)) within more narrowly defined domains. Some ethics regimes (e.g. lawyers) have associated enforcement mechanisms interpreting those rules in the context of specific controversies. ML systems are domain specific because they learn best on data within a narrow, coherent, data domain. Software verification and validation proceeds through mechanized testing processes based on specific test cases rather than natural language human inquiry processes. The nature of AI systems as continuing to learn during operational phases requires ongoing testing to verify and validate proper operation including ethical constraints. The metrics for those ethical constraints themselves requires further elucidation.

Guidance of greater detail and specificity may be appropriate to identify and assess ethical risks in the context of these domains, or, defining data domains around ethical risks to better match ML capabilities. New metrics would seem to be required for the assessment of ethical risks, but which metrics? And how many do we need? Guidance and standards for ethics are still nascent and targeted at principles rather than ethics performance benchmarks. Rather than developing metrics top down from broad principles, it may be more practical to develop them in the ethical context of particular tasks or organization or professional behavior patterns. From this perspective, the development of a set of metrics targeted at a particular ethical context (e.g., the rules of professional responsibility for lawyers) should proceed first. Later categorization of those metrics against a broader framework may provide a perspective on the scope of metric coverage; and enable insight from metrics developed in other contexts.

Traditional software is developed by writing down the program logic that governs system behavior. With Machine Learning (ML) software, the rules are inferred from training data. For many software systems, the code base may rely on third party libraries. In the case of ML systems, the training of the ML system may be done with third party data. Most software development processes stress (to varying degrees) the testing of the software under development. Operation of large-scale complex software systems typically frontloads functionality testing into acceptance tests then typically deals with software changes as release upgrades which may have some degree of regression testing depending on the operational environment and the software supply chain. ML software systems have fundamentally different operational characteristics because their learning mechanisms can change the behavior of the software outside of the traditional software update processes.

The testing challenge for ML systems is clearly significant without even considering the specific challenges of testing ML systems for compliance against high level ethical objectives with new metrics and performance benchmarks. AI software can be expected to undergo revisions and updates just as other software does. With ML software continuing to learn new behavior during operation, ongoing assessments of ethics risks will be required. The data collection processes driving operation of the AI software may change overtime, and the AI system may learn new behaviors from new data. In addition, the metrics and measurement techniques for assessing ethics performance could be expected to evolve over time. The number of changing components reinforces the need for some continuous improvement mechanism for assessing the ethical performance of ML systems. A PDCA cycle focused on assessment of ethics risks could help maintain and improve ethical performance of these software systems.

Ethical guidance for the developers and operators of such AI and autonomous systems is slowly emerging. Software technology approaches for verification and validation of ethical software will challenge the specificity of existing professional guidelines for ethical conformance. The exponential growth in data and consequent changes in business practices demands responses from the professions, government, and the public. The volume of data challenges traditional natural language methods for ethics inquiries. New metrics and automated measurement approaches may be tractable if suitable ethical performance benchmarks can be established. Continuous improvement approaches (e.g. PDCA cycle) could be applied to raise the performance benchmarks for ethical AI software over time.

If you need help with AI ethics issue, contact me.

An extended treatment of this topic is available in a paper presented at the IEEE 4th International Workshop on Applications of Artificial Intelligence in the legal Industry (part of the IEEE Big Data Conference 2020).

References

(Jobin et.al. 2019) Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389-399