You are on page 1of 5

Volume 7, Issue 5, May – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Cancer Prediction using Machine Learning


Sawan Verma, Saloni Chaudhary, Sunil Kumar, Pranav Singh Rana
Department of Computer Science & Engineering
Meerut Institute of Engineering and Technology, Meerut

Abstract:- The breast cancer is very common and a II. BACKGROUND /LITERATURE REVIEW
dominant cancer in women over the world. It is
increasing in countries which are developing and where On the Detection of Breast Cancer: It is an
most of the cases are prognosed in the late stages. Some operation/Application of ML Algorithms at the Wisconsin
of the strategies which were already suggested or Diagnostic Dataset via way of means of the Abien Fred M.
proposed and shows a comparability between the ML Agarap. There are 6 Machine Learning algorithm’s that
algo by using various approach such as the ensemble are used for detection of most cancers in this paper.
approach , using blood analysis or Data mining algo GRUSVM model is used for the prognosis of breast most
,etc.In this research paper there is a comparability of cancers GRUSVM, Softmaxregression, K-NN, LR
the two machine leaning algo RF(Random Forest) and (LinearRegression), Multilayer Perceptron,search and SVM
Decision Tree. The data set was divided into the two at the Wisconsin Diagnostic Breast Cancer dataset via way
stages that is training stage and the testing stage. The of means of measuring their type check accuracy, and there
algo will be used in this application which gives the best specificity and sensitivity values. A stated dataset includes
results and then the approach of the model will be functions that have been estimated from digitized pictures
classifies that the cancer just as malignant or benign. of FNA checks on a mass of breast. So that Machine
learning algorithms implemented, the Dataset become
Keywords:- Machine Learning, Breast Cancer, isolated within side the following style 70 percentage for
Identification, Classification, Prediction, Random Forest, education stage, and 30 percentage for the trying out stage.
Decision Tree, Malignant, Benign. Their effects have been that every one offered machine
learning algorithm’s displayed excessive overall
I. INTRODUCTION achievement at the binary type of tumor, i.e. figuring out
whether or not benign cancer or malignant cancer. Hence,
Breast Cancer is a very usual and a dominant Cancer the analytical measures at the type trouble have been
among women over the world[4].According to the global additionally satisfying. To similarly strengthen the effects
statistics that represent the preponderance of new cancer's for this research, the approach of CV including the k-fold
patients and cancer-relevant deaths and it makes a serious & cross-validation need to be use. A equipment of one of
health issue of the public in the societies[2].The initial these manner may not handiest offer a extra correct degree
diagnose of the breast cancer it can be improves by the of version prediction overall performance, however it will
methods of predictions & chance of survival substantially, additionally help in figuring out the most top of the line
as it can encourage clinical treatment of the patients on hyper-para-meters for the machine learning algorithm’s[3].
time. In addition the exact classification of being cancer
can avoid the people were going for treatments which are A ML approach analysis for a Breast Cancer
not necessary. So the subject about much research is the Prediction with the aid of using in VIT university,vellore
breast cancer’s correct diagnosis and classification of by Priyanka Gandhi and Prof.Shalini L. In this research
patients that is the patients belongs to the group of paper, ML strategies are observed to be able to increase a
malignant or benign[1].Due to its different benefits in accuracy of diagnosis. Approach along with CART
overcritical factors detection from a breast cancer datasets ,KNN,RF(Random Forest) are compared. The dataset used
,ML is universally accepted as the technique of alternative is received from UC Irvine ML Repository. It is discovered
in breast cancer classification. The effective ways to that KNN set of rules has tons higher overall
classify the data are the methods of classification and data implementation than the alternative strategies utilized in
mining . Especially in the field of medical, where those comparison. The maximum correct version changed into K-
methods are extensively used in examination and diagnosis Nearest Neighbour. The type version along with RF algo
to make the conclusion. The analysis focus to detect the and BT(Boosted Trees) confirmed the same certainty.
features that are better helpful in predicting malignant or Hence, the maximum correct classier may be used to
benign cancer and to see the usual trends that may aids us discover the cancer in order that the remedy may be
in the selection of model and selection of hyper discovered in initial phase[4].
parameters. The main goal is to classifying that the cancer
of breast is belongs to the group of benign or malignant. A breast Cancer Diagnosis via way of means of
We have used the classification of machine learning Dierent ML approaches by Using Blood Analysis Data via
methods to achieve this and fit the function which can be way of means of a Akif Durdu,Muhammet Faith Aslam,
predict the distinct class of new information/inputs. Kadir Sabanci and Yunus Celik for tumor initial diagnosis.
During this paper, 4 dierent ML algo have been used for the
initial recognition of tumor. A purpose of this undertaking
is to procedure the consequences of habitual blood
evaluation with dierent Machine Learning strategies.

IJISRT22MAY1392 www.ijisrt.com 1231


Volume 7, Issue 5, May – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Approaches used are Extreme Learning Machine, ANN, k- pattern, that's a document kind utilized by a Weka tool. A
Nearest Neighbor and Support Vector Machine. UCI ten-fold cross-validation changed into so it use to achieve
library provides this dataset. In this dataset age, chemokine the maximum true consequences the use of the Extraction
monocyte chemoattractant protein (MCP1), resistin, of the knowledge primarily based totally on Progressive
adiponectin, leptin, (HOMA), insulin, glucose and BML Studying facts mining software program tool. RF finished a
attributes had been used. Parameters which have the high- nice at some stage in the ten fold cross-validation providing
quality accuracy values had been determined via way of a median accuracy of 92.2 %[7].
means of the use of four dierent Machine Learning
techniques. This dataset consists of adiponectin, HOMA, This project “Breast Cancer Prediction Using Data
resistin, leptin, insulin, glucose, BMI, age and MCP1 Mining Method” with the aid of using Sang Won Yoon and
capabilities that may be obtained in habitual blood Hafeng Wang is used to check the impact of characteristic
evaluation. The importance of those statistics in breast most area reduction, a hybrid among main factor evaluation
cancers detection turned into investigated via way of means (PCA) & associated data mining models is proposed, that
of ML strategies. The evaluation turned into accomplished execute the precept factor evaluation approach to lessen the
with four dierent strategies of ML.KNN & SVM strategies characteristic area. To examine the overall performance of
are decided a use of Hyper-para-meter optimization those models, broadly used check data units are used,
technique. The maximum accuracy and minimum schooling Wisconsin Breast Cancer Database (1991) & Wisconsin
time had been given via way of means of ELM which Diagnostic Breast Cancer (1995). 10- fold cross-validation
turned into 80 percent & 0.42 sec[5]. approach is applied to measure a check mistakess of every
model. PCs-SVM is best for WBC data that could be the
Estimation of the work of ML approach for the 97.forty seven percent, and PCi-ANN is the first-class
Breast Cancer forecast/Prediction with a aid of using thinking about accuracy for WDBC data this is 99.63%. A
Zixuan Chen & Yixuan Li used a datasets withinside the purpose for higher effects from PCA preprocessing is due
examine. A examine first of all collects the statistics of a to the data the main additives handiest constitute a huge a
BCCD dataset that includes 116 patient with nine features part of the data withinside a entire data space , which to a
and statistics of WBCD dataset that includes 699 patient & point can decrease data noise, as a outcome, characteristic
eleven features. After that we preprocess a uncooked space is enriched[8].
statistics of WBCD dataset & received a information that
incorporates 683 patient with 9 features & consequently the “Machine Learning with Application in breast most
index distinguishing even if or not the patient has the cancers Diagnosis and Prognosis” through Webin Yue and
malignant cancer. After evaluating a accuracy, Fmeasure Zidong Wang In this prospectus, they furnished
metric & ROC curve of five type models, the end result has explainations of diverse ML strategies and their utility in
proven that Random Forest is selected because the number BC analysis will not to examine the information in the
one type version all through this study.Hence, effects of benchmark database WBCD. ML strategies have proven
this examine offer a reference for specialists to differentiate their incredible cappotential to beautify class and prediction
the man or woman of carcinoma .In this examine, there are accuracy. However many algo’s have finished very
nonetheless a few obstacles that have to be decode in excessive accuracy in WBCD, a occasion of progressed
addition effort. For present, after all additionally exist a few algo’s stays necessary. Classification accuracy can be a
indices human beings haven’t discovered yet, this examine important evaluation standards however it is now no longer
best gathered a information of ten attributes all through this the only one. Different algorithms keep in mind exclusive
analysis. The restricted statistics has an effect at the aspects, and feature exclusive mechanisms. Although for
accuracy of effects. additionally , the Random Forest also numerous a long time artificial neural network have ruled
can be mixed up with a different statistics datamining BC analysis and diagnosis, it is clean that greater currently
strategies to get extra correct and green effects withinside opportunity Machine learning techniques are implemented
the long term work[6]. to intelligent healthcare system to apply the variety of
alternatives to medical practitioner[9].
A motive of this research prospectus “Breast Cancer
Prediction and Detection Using Data Mining Classification A. Breast Cancer Classification
Algorithms: a comparative Study”by Mumine Kaya Keles Breast most cancers class is a class which divides the
changed into to expect & hit upon breast most cancers early carcinoma into classes relying on how they have spread at
even supposing a cancer length is petite hit upon non- all. Classification algo’s offers the prediction
invasive and pain-free strategies that use facts mining class approximately one or greater discrete variables and help the
algo’s. Hence, the contrast of facts mining class algo alternative features in a dataset. To run the classification
changed into create with a Weka tool.This research algorithms the data processing software program is
prospectus, a Weka facts mining software program changed required. The purpose of analysis is to pick out a best
into implemented to an antenna dataset if you want to have remedy. Analysis lets scientists to find, group, and nicely
a look at the efficacy of facts mining strategies withinside call organisms through a uniform device that' why it's far
the diagnosis of breast most cancers. The dataset that necessary. There are notably used techniques with inside
changed into that created had 6006 values, 5405 of which the data processing are classification and clustering.
had been used because the schooling dataset, at the same Clustering is locate to extract records from the fixed of
time as 601 had been used because the take a look at facts understanding to get businesses or clusters and illustrate a
set. The dataset changed into then transformed to a arff set of records itself. Classification is likewise known as

IJISRT22MAY1392 www.ijisrt.com 1232


Volume 7, Issue 5, May – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
supervised learning of with inside the ML of, it use to range. After the feature selection, the next step is model
categorise the unexplained conditions supported learning of selection. Model selection is defined as the technique or
current styles and classes from the set of records and finally method of selection of model from the candidate models for
deliver the prediction at the destiny conditions. The training a the dataset of training. Now the last step is prediction.
set, that's hired to create the structure which is classifying Prediction refers to the predict the output after the model
structure, and consequently the check set, that has a has been trained on the previous dataset and applied to the
tendency to evaluate a classifier, are usually stated in a new dataset[1].
class responsibilities classification can be a pretty
complicated optimization problem. There are many
machine learning of techniqes are carried out with the aid
of using researchers to clear up this classification problem.
The artificial neural network, random forest, aid vector
device, and so on are the maximum well-known set of rules
this is used for breast most cancers class or prediction.
Scientists attempt to discover the best set of rules to
recognize the main correct class result, however, facts of
variable best will also have an effect on the class result.
Further, the uncommonness of understanding will have an
effect on the variety of set of rules packages also. If the
early observation is done of carcinoma is, there are greater
remedy alternatives and a miles higher hazard for the
survival. A women whose cancer is identified at an initial
degree have a ninety three percentage or better survival
charge in the first five years. You can positioned your
thoughts snug with the aid of using getting checked
regularly. Finding most cancers early stage also can save
the life[1].

B. Machine learning algorithms


A ML is an utility of AI which deliver a strength to the Fig. 1: Proposed Methodology
model to routinely examine and enhance from revel in
A. Random forest
without being programmed manually. ML emphasis and
RF is a learning algorithm which fit to a learning
relies upon at the phase of computer applications a good
category of supervised. It is used for both regression as well
way to get provided the data furnished and use that data to
as classification. Random forests additionally referred to as
examine. The approach of learning of starts with datasets,
RDF(“random decision forests”) which makes a massive
specimen, training, rules.So that you can then determine out
quantity of trees that obtain their output through whole
a sample & capable of make a upgrades withinside the
study of techniques for category and regression. There are
close to future, if necessary.
two features bagging and feature randomness uses to
construct those trees. RF is more better than the DT
III. PROPOSED METHODOLOGY
because it does not overfit the data[1].
Her we proposed a methodology that contains some
B. Decision Tree
steps those are data preprocessing, data preparation, feature
DT is also a supervised learning algorithm. The aim of
selection, feature projection, feature scaling, model
using this algorithm is to make the training model that can
selection and prediction. Now discuss about these steps.
be use to predict a value of a target variable. It use a top-
First step is data preprocessing, it is very important step in
down technique to data in order which give a knowledge
data mining. It explained that the data is manipulated or
set, they conflict to institution and label conclusion which
modified before its use in the model to make the process
might be comparable among them, and look for the
easier. The second step is data preparation, basically it
simplest guidelines that cut up the observations that aren't
refers to the cleaning of the data and it ensures that the
the equal among them till they attain a positive quantity of
given data is accurate. Now third step is feature selection, it
similarity. They use a process that is known as layered
refers to the variable or attribute selection, It refers to the
splitting, in which at every layer they conflict to split the
variable or attribute selection. We can explained it as a
data into or greater groups, simply so data fall below an
process or technique to reduce the number of input
equivalent group is maximum just like each other, and
variables during the development of predictive model or we
groups are as die-rent as feasible from each other[1].
can say that selection of most compelling features from a
given set of data. Next step is feature projection, it is also
know as feature extraction. Basically feature projection is
used to reduce the dimensionality of space. It covert the
higher dimensional space into the fewer dimensional space.
Next step is feature scaling, it refers to the standardization
of independent feature that are present in the data of fixed

IJISRT22MAY1392 www.ijisrt.com 1233


Volume 7, Issue 5, May – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. RESULT AND DISCUSSION The figure 4 gives the data of people who are
suffering from cancer or not as the result. Clearly, we can
We discuss the result through the different diagrams see that 212 people belong to the group of malignant and
those represents that how much people belongs to the group 357 people belongs to the group of benign. As the result
of malignant or benign. There are four representations of we found that Random forest algorithm gives the best
the result in the form of diagrams as barplot, matrix, result as compare to the DT algorithm. DT algorithm
histogram and pairplot. These shows the result:- gives the accuracy about 84% and Random forest
algorithm gives the accuracy about 98% which is more
In fig 2, there are some graphs that show the result than the accuracy of Decision Tree. On the comparison
about the cancer is malignant or benign. These graphs between these two algorithms, we got the higher accuracy
shows that how much people belongs to the group of with the random forest algorithm.
malignant and how much people belongs to the people
belongs to the group of benign. In this figure, there are two
target values that are 0.0 and 1.0. Here 0.0 means the
person belongs to the group of malignant and 1.0 means the
person belong to the group of benign.

Fig. 4: Data Visualization Countplot of


Cancer and non cancer

In Fig. 5, the result represents in the form of


Fig. 2: Pairplot of Cancer Dataframe correlation matrix. Now we discuss about the correlation
matrix. Basically correlation matrix shows the correlation
In Fig. 3, the result represents in the form of
between the two different variables in the form of table. In
correlation barplot. Now we discuss about the correlation
above figure Fig. 5, 1 represents that the person belongs to
barplot. Basically correlation barplot shows the result in the
the group of benign and 0 represents that the person
form of barplot by creating the figure of correlation
belongs to the group of malignant. Here malignant means
coefficient. In above figure Fig. 3, there are two groups
person suffering from cancer and benign means person is
positive and negative. Maximum bars shows the negative
not suffering from cancer. This figure clearly show that
result of these features but some shows the positive result
how much person belongs to the group of benign or
of these features. In this figure Fig. 3, if we remove these
malignant.
particular features ( mean fractal dimension, texture error
and symmetry error) then the accuracy in result will be
increase because these features having the less data that is
not capable for correlation with the target value.

Fig. 5: Correlation Matrix


Fig. 3: Correlation Barplot

IJISRT22MAY1392 www.ijisrt.com 1234


Volume 7, Issue 5, May – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION

If breast most cancers located at the stage of early it


will help the thousands of persons to save their lives. This
undertaking will assist the actual global sufferers and docs
to acquire the lot of data as they. The studies on 9 research
documentation has helped us to acquire a information for a
undertaking prospective via way of means of us. We may
be capable of classify and are expecting the most cancers
into being or malignant via way of means of the use of the
ML algorithms. ML algo may be used for scientific
orientated studies, it speed up’s the system, reduces the
human mistakes & decreases the guide mistakes. And it
will very helpful for the human beings because it can saves
the life of the people by diagnosis at earlier stage of the
cancer.

REFERENCES

[1.] ”Ultrasound characterization of breast masses”, The


Indian journal of radiology imaging by S. Gokhale,
Vol. 19, pp. 242-249, 2009. K. Elissa, “Title of paper
if known,” unpublished.
[2.] Breast Cancer Prediction Using Genetic Algorithm
Based Ensemble Approach” by Pragya Chauhan and
Amit Swami, 18 October 2018.
[3.] “On Breast Cancer Detection: An Application of
Machine Learning Algorithms on the Wisconsin
Diagnostic Dataset” by Abien Fred M. Agarap, 7
February 2019.
[4.] “Analysis of Machine Learning Techniques for Breast
Cancer Prediction” by the Priyanka Gupta and Prof.
shalini L of VIT university, vellore, 5 May 2018.
[5.] “Breast Cancer Diagnosis by Dierent Machine
Learning Methods Using Blood Analysis Data“ by the
Muhammet Fatih Aslan, Yunus Celik , Kadir Sabanci
and Akif Durdu, 31 December, 2018.
[6.] ”Performance Evaluation of Machine Learning
Methods for Breast Cancer Prediction”, by Yixuan Li,
Zixuan Chen October 18, 2018.
[7.] “Breast Cancer Prediction and Detection Using Data
Mining Classification Algorithms: A Comparative
Study” by Mumine Kaya Keles, Feb 2019.
[8.] “Breast Cancer Prediction Using Data Mining Method
” by Haifeng Wang and Sang Won Yoon, Department
of Systems Science and Industrial Engineering State
University of New York at Binghamton Binghamton,
May 2015.
[9.] “Machine Learning with Applications in Breast
Cancer Diagnosis and Prognosis” by Wenbin Yue,
Zidong Wang, 9 May 2018.

IJISRT22MAY1392 www.ijisrt.com 1235

You might also like